The Role Of Vocabularies In Learning Sparse Representations For Ranking
2025 Β· Hiun Kim, Tae Kwan Lee, Taeryun Won
Abstract
Learned Sparse Retrieval (LSR) such as SPLADE has growing interest for effective semantic 1st stage matching while enjoying the efficiency of inverted indices. A recent work on learning SPLADE models with expanded vocabularies (ESPLADE) was proposed to represent queries and documents into a sparse space of custom vocabulary which have different levels of vocabularic granularity. Within this effort, however, there have not been many studies on the role of vocabulary in SPLADE models and their relationship to retrieval efficiency and effectiveness. To study this, we construct BERT models with 100K-sized output vocabularies, one initialized with the ESPLADE pretraining method and one initialized randomly. After finetune on real-world search click logs, we applied logit score-based queries and documents pruning to max size for further balancing efficiency. The experimental result in our evaluation set shows that, when pruning is applied, the two models are effective compared to the 32K-s
Authors
(none)
Tags
Stats
Related papers
- Improved Learned Sparse Retrieval With Corpus-specific Vocabularies (2024)6.34
- Learning Retrieval Models With Sparse Autoencoders (2026)0.00
- Mistral-splade: Llms For Better Learned Sparse Retrieval (2024)0.00
- On The Challenges And Opportunities Of Learned Sparse Retrieval For Code (2026)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- Cosplade: Contextualizing SPLADE For Conversational Information Retrieval (2023)6.77
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- Adapting Learned Sparse Retrieval For Long Documents (2023)5.24