Pre-training Tasks For Embedding-based Large-scale Retrieval
2020 Β· Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, et al.
Abstract
We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this pape
Authors
(none)
Tags
Stats
Related papers
- Domain-matched Pre-training Tasks For Dense Retrieval (2021)5.24
- Progressively Optimized Bi-granular Document Representation For Scalable Embedding Based Retrieval (2022)11.06
- Improving Bert-based Query-by-document Retrieval With Multi-task Optimization (2022)9.92
- Diagnosing BERT With Retrieval Heuristics (2022)10.21
- Unifier: A Unified Retriever For Large-scale Retrieval (2022)7.50
- Large Reasoning Embedding Models: Towards Next-generation Dense Retrieval Paradigm (2025)0.00
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Pre-training For Ad-hoc Retrieval: Hyperlink Is Also You Need (2021)10.35