To Case Or Not To Case: An Empirical Study In Learned Sparse Retrieval
2026 Β· Emmanouil Georgios Lionis, Jia-Huei Ju, Angelos Nalmpantis, et al.
Abstract
Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre-processing the text to lowercase. Moreover, our token-level ana
Authors
(none)
Tags
Stats
Related papers
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- Mistral-splade: Llms For Better Learned Sparse Retrieval (2024)0.00
- The Role Of Vocabularies In Learning Sparse Representations For Ranking (2025)0.00
- Learning Retrieval Models With Sparse Autoencoders (2026)0.00
- Adapting Learned Sparse Retrieval For Long Documents (2023)5.24
- On The Challenges And Opportunities Of Learned Sparse Retrieval For Code (2026)0.00
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34
- On The Reproducibility Of Learned Sparse Retrieval Adaptations For Long Documents (2025)0.00