Semi-parametric Retrieval Via Binary Bag-of-tokens Index
2024 Β· Jiawei Zhou, Li Dong, Furu Wei, et al.
Abstract
Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers
Authors
(none)
Tags
Stats
Related papers
- SEINE: Segment-based Indexing For Neural Information Retrieval (2023)0.00
- Semantic-enhanced Differentiable Search Index Inspired By Learning Strategies (2023)10.35
- Vectorsearch: Enhancing Document Retrieval With Semantic Embeddings And Optimized Search (2024)0.00
- Ultra-high Dimensional Sparse Representations With Binarization For Efficient Text Retrieval (2021)8.60
- From HNSW To Information-theoretic Binarization: Rethinking The Architecture Of Scalable Vector Search (2025)0.00
- Evaluating Embedding Apis For Information Retrieval (2023)8.09
- EHI: End-to-end Learning Of Hierarchical Index For Efficient Dense Retrieval (2023)0.00
- Efficient Neural Ranking Using Forward Indexes And Lightweight Encoders (2023)5.24