Scatterbrain: Unifying Sparse And Low-rank Attention Approximation
2021 Β· Beidi Chen, Tri Dao, Eric Winsor, et al.
Abstract
Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences. However, it is still challenging to balance the trade-off between model quality and efficiency to perform a one-size-fits-all approximation for different tasks. To better understand this trade-off, we observe that sparse and low-rank approximations excel in different regimes, determined by the softmax temperature in attention, and sparse + low-rank can outperform each individually. Inspired by the classical robust-PCA algorithm for sparse and low-rank decomposition, we propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate and efficient approximation. The estimation is unbiased with provably low error. We empirically show that Scatterbrain can achieve 2.1x lower error than baselines when serving as a
Authors
(none)
Tags
Stats
Related papers
- Treeformer: Dense Gradient Trees For Efficient Attention Computation (2022)0.00
- Distrattention: An Efficient And Flexible Self-attention Mechanism On Modern Gpus (2025)0.00
- Ecoformer: Energy-saving Attention With Linear Complexity (2022)3.75
- Beyond Matryoshka: Revisiting Sparse Coding For Adaptive Representation (2025)4.30
- Retrievalattention: Accelerating Long-context LLM Inference Via Vector Retrieval (2024)0.00
- A Theoretical View On Sparsely Activated Networks (2022)0.00
- Sparton: Fast And Memory-efficient Triton Kernel For Learned Sparse Retrieval (2026)0.00
- Efficient Inverted Indexes For Approximate Retrieval Over Learned Sparse Representations (2024)11.67