Colbert-xm: A Modular Multi-vector Representation Model For Zero-shot Multilingual Information Retrieval
2024 Β· Antoine Louis, Vageesh Saxena, Gijs van Dijck, et al.
Abstract
State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on
Authors
(none)
Tags
Stats
Related papers
- Transfer Learning Approaches For Building Cross-language Dense Retrieval Models (2022)10.97
- Towards Zero-shot Cross-lingual Image Retrieval (2020)2.46
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Boosting Zero-shot Cross-lingual Retrieval By Training On Artificially Code-switched Data (2023)4.52
- Jina-colbert-v2: A General-purpose Multilingual Late Interaction Retriever (2024)5.24
- Boosting Data Utilization For Multilingual Dense Retrieval (2025)0.00
- Freeret: Mllms As Training-free Retrievers (2025)0.00
- Tevatron 2.0: Unified Document Retrieval Toolkit Across Scale, Language, And Modality (2025)3.58