Pqcache: Product Quantization-based Kvcache For Long Context LLM Inference
2024 Β· Hailin Zhang, Xiaodong Ji, Yilin Chen, et al.
Abstract
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques prevalent in the data management community, we consider the storage and retrieval of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, we use PQ codes and centroid
Authors
(none)
Tags
Stats
Related papers
- Hashevict: A Pre-attention KV Cache Eviction Strategy Using Locality-sensitive Hashing (2024)0.00
- Neurocache: Efficient Vector Retrieval For Long-range Language Modeling (2024)1.91
- Retrievalattention: Accelerating Long-context LLM Inference Via Vector Retrieval (2024)0.00
- Efficient Context Selection For Long-context QA: No Tuning, No Iteration, Just Adaptive-\(k\) (2025)2.26
- SLQ: Bridging Modalities Via Shared Latent Queries For Retrieval With Frozen Mllms (2026)0.00
- Matching-oriented Product Quantization For Ad-hoc Retrieval (2021)2.29
- Going Down Memory Lane: Scaling Tokens For Video Stream Understanding With Dynamic Kv-cache Memory (2026)0.00
- Beyond Product Quantization: Deep Progressive Quantization For Image Retrieval (2019)12.95