Sparton: Fast And Memory-efficient Triton Kernel For Learned Sparse Retrieval
2026 Β· Thong Nguyen, Cosimo Rulli, Franco Maria Nardini, et al.
Abstract
State-of-the-art Learned Sparse Retrieval (LSR) models, such as Splade, typically employ a Language Modeling (LM) head to project latent hidden states into a lexically-anchored logit matrix. This intermediate matrix is subsequently transformed into a sparse lexical representation through element-wise operations (ReLU, Log1P) and max-pooling over the sequence dimension. Despite its effectiveness, the LM head creates a massive memory bottleneck due to the sheer size of the vocabulary (V), which can range from 30,000 to over 250,000 tokens in recent models. Materializing this matrix creates a significant memory bottleneck, limiting model scaling. The resulting I/O overhead between operators further throttles throughput and runtime performance. In this paper, we propose Sparton, a fast memory-efficient Triton kernel tailored for the LM head in LSR models. Sparton utilizes a fused approach that integrates the tiled matrix multiplication, ReLU, Log1P, and max-reduction into a single GPU kern
Authors
(none)
Tags
Stats
Related papers
- Mistral-splade: Llms For Better Learned Sparse Retrieval (2024)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- On The Challenges And Opportunities Of Learned Sparse Retrieval For Code (2026)0.00
- Retrievalattention: Accelerating Long-context LLM Inference Via Vector Retrieval (2024)0.00
- Lightretriever: A Llm-based Text Retrieval Architecture With Extremely Faster Query Inference (2025)0.00
- The Role Of Vocabularies In Learning Sparse Representations For Ranking (2025)0.00
- Learning Retrieval Models With Sparse Autoencoders (2026)0.00