Typos-aware Bottlenecked Pre-training For Robust Dense Retrieval
2023 Β· Shengyao Zhuang, Linjun Shou, Jian Pei, et al.
Abstract
Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit\{fine-tuning\} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel re-training strategy for DRs that increases their robustness to misspelled
Authors
(none)
Tags
Stats
Related papers
- Typo-robust Representation Learning For Dense Retrieval (2023)7.50
- Analysing The Robustness Of Dual Encoders For Dense Retrieval Against Misspellings (2022)9.59
- Improving The Robustness Of Dense Retrievers Against Typos Via Multi-positive Contrastive Learning (2024)5.84
- Towards Dynamic Dense Retrieval With Routing Strategy (2026)0.00
- Learning To Retrieve: How To Train A Dense Retrieval Model Effectively And Efficiently (2020)0.00
- Bridging The Training-inference Gap For Dense Phrase Retrieval (2022)2.26
- Efficiently Teaching An Effective Dense Retriever With Balanced Topic Aware Sampling (2021)17.07
- How To Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval (2023)11.39