Pre-training For Ad-hoc Retrieval: Hyperlink Is Also You Need
2021 Β· Zhengyi Ma, Zhicheng Dou, Wei Xu, et al.
Abstract
Designing pre-training objectives that more closely resemble the downstream tasks for pre-trained language models can lead to better performance at the fine-tuning stage, especially in the ad-hoc retrieval area. Existing pre-training approaches tailored for IR tried to incorporate weak supervised signals, such as query-likelihood based sampling, to construct pseudo query-document pairs from the raw textual corpus. However, these signals rely heavily on the sampling method. For example, the query likelihood model may lead to much noise in the constructed pre-training data. \blfootnote\{\(\dagger\) This work was done during an internship at Huawei.\} In this paper, we propose to leverage the large-scale hyperlinks and anchor texts to pre-train the language model for ad-hoc retrieval. Since the anchor texts are created by webmasters and can usually summarize the target document, it can help to build more accurate and reliable pre-training samples than a specific algorithm. Considering dif
Authors
(none)
Tags
Stats
Related papers
- Pre-training For Information Retrieval: Are Hyperlinks Fully Explored? (2022)0.00
- C3: Continued Pretraining With Contrastive Weak Supervision For Cross Language Ad-hoc Retrieval (2022)8.35
- Pre-training Tasks For Embedding-based Large-scale Retrieval (2020)0.00
- Unsupervised Dense Retrieval Training With Web Anchors (2023)3.81
- Learning To Retrieve: How To Train A Dense Retrieval Model Effectively And Efficiently (2020)0.00
- Pre-training Vs. Fine-tuning: A Reproducibility Study On Dense Retrieval Knowledge Acquisition (2025)0.95
- Adversarial Sampling And Training For Semi-supervised Information Retrieval (2018)14.43
- Large Language Models Are Built-in Autoregressive Search Engines (2023)13.49