Improved Learned Sparse Retrieval With Corpus-specific Vocabularies
2024 Β· Puxuan Yu, Antonio Mallia, Matthias Petri
Abstract
We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.
Authors
(none)
Tags
Stats
Related papers
- The Role Of Vocabularies In Learning Sparse Representations For Ranking (2025)0.00
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- Predicting Efficiency/effectiveness Trade-offs For Dense Vs. Sparse Retrieval Strategy Selection (2021)11.29
- On The Challenges And Opportunities Of Learned Sparse Retrieval For Code (2026)0.00
- Early Stage Sparse Retrieval With Entity Linking (2022)6.77
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- Exploring \(\ell_0\) Sparsification For Inference-free Sparse Retrievers (2025)4.52