Investigating The Scalability Of Approximate Sparse Retrieval Algorithms To Massive Datasets
2025 Β· Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, et al.
Abstract
Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiven
Authors
(none)
Tags
Stats
Related papers
- Efficient Inverted Indexes For Approximate Retrieval Over Learned Sparse Representations (2024)11.67
- Pairing Clustered Inverted Indexes With Knn Graphs For Fast Approximate Retrieval Over Learned Sparse Representations (2024)7.50
- Adapting Learned Sparse Retrieval For Long Documents (2023)5.24
- Approximate Cluster-based Sparse Document Retrieval With Segmented Maximum Term Weights (2024)0.00
- Ultra-high Dimensional Sparse Representations With Binarization For Efficient Text Retrieval (2021)8.60
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34
- Scaling Laws For Embedding Dimension In Information Retrieval (2026)0.00
- On The Challenges And Opportunities Of Learned Sparse Retrieval For Code (2026)0.00