Retrievalattention: Accelerating Long-context LLM Inference Via Vector Retrieval
2024 Β· di Liu, Meng Chen, Baotong Lu, et al.
Abstract
Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to build approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieve the most relevant ones through vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in the attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorith
Authors
(none)
Tags
Stats
Related papers
- Neurocache: Efficient Vector Retrieval For Long-range Language Modeling (2024)1.91
- Hashevict: A Pre-attention KV Cache Eviction Strategy Using Locality-sensitive Hashing (2024)0.00
- Lightretriever: A Llm-based Text Retrieval Architecture With Extremely Faster Query Inference (2025)0.00
- Pqcache: Product Quantization-based Kvcache For Long Context LLM Inference (2024)7.50
- You Only Use Reactive Attention Slice For Long Context Retrieval (2024)0.00
- Making Large Language Models Efficient Dense Retrievers (2025)0.00
- Scalingnote: Scaling Up Retrievers With Large Language Models For Real-world Dense Retrieval (2024)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84