Hashevict: A Pre-attention KV Cache Eviction Strategy Using Locality-sensitive Hashing
2024 Β· Minghui Liu, Tahseen Rabbani, Tony O'Halloran, et al.
Abstract
Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. Unlike existing compression strategies that compute attention to determine token retention, HashEvict makes these decisions pre-attention, thereby reducing computational costs. Additionally, HashEvict is dynamic - at every decoding step, the key and value o
Authors
(none)
Tags
Stats
Related papers
- Pqcache: Product Quantization-based Kvcache For Long Context LLM Inference (2024)7.50
- Retrievalattention: Accelerating Long-context LLM Inference Via Vector Retrieval (2024)0.00
- Neurocache: Efficient Vector Retrieval For Long-range Language Modeling (2024)1.91
- Distilling Vision-language Pretraining For Efficient Cross-modal Retrieval (2024)0.00
- Transhash: Transformer-based Hamming Hashing For Efficient Image Retrieval (2021)13.44
- Prompthash: Affinity-prompted Collaborative Cross-modal Learning For Adaptive Hashing Retrieval (2025)7.70
- Going Down Memory Lane: Scaling Tokens For Video Stream Understanding With Dynamic Kv-cache Memory (2026)0.00
- Vqtoken: Neural Discrete Token Representation Learning For Extreme Token Reduction In Video Large Language Models (2025)0.00