Abstract

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit\{fine-tuning\} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel re-training strategy for DRs that increases their robustness to misspelled

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations5
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score5.84
  • arxiv keyzhuang2023typos

Related papers