Boosting Data Utilization For Multilingual Dense Retrieval
2025 Β· Chao Huang, Fengran Mo, Yufeng Chen, et al.
Abstract
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
Authors
(none)
Tags
Stats
Related papers
- What Drives Cross-lingual Ranking? Retrieval Approaches With Multilingual Language Models (2025)0.00
- Unsupervised Dense Information Retrieval With Contrastive Learning (2021)0.00
- Soft Prompt Decoding For Multilingual Dense Retrieval (2023)7.50
- Boosting Zero-shot Cross-lingual Retrieval By Training On Artificially Code-switched Data (2023)4.52
- Colbert-xm: A Modular Multi-vector Representation Model For Zero-shot Multilingual Information Retrieval (2024)0.00
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Empowering Dual-encoder With Query Generator For Cross-lingual Dense Retrieval (2023)6.34
- Freeret: Mllms As Training-free Retrievers (2025)0.00