Precise Zero-shot Dense Retrieval Without Relevance Labels
2022 Β· Luyu Gao, Xueguang Ma, Jimmy Lin, et al.
Abstract
While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details
Authors
(none)
Tags
Stats
Related papers
- Hypencoder: Hypernetworks For Information Retrieval (2025)4.52
- Injecting Domain Adaptation With Learning-to-hash For Effective And Efficient Zero-shot Dense Retrieval (2022)2.80
- A Representation Sharpening Framework For Zero Shot Dense Retrieval (2025)0.00
- Pseudo-relevance Feedback For Multiple Representation Dense Retrieval (2021)12.93
- Selecting Which Dense Retriever To Use For Zero-shot Search (2023)6.34
- Embedding-based Zero-shot Retrieval Through Query Generation (2020)0.00
- Improving Query Representations For Dense Retrieval With Pseudo Relevance Feedback (2021)12.10
- Hierarchical Corpus Encoder: Fusing Generative Retrieval And Dense Indices (2025)0.00