Boosting Zero-shot Cross-lingual Retrieval By Training On Artificially Code-switched Data
2023 Β· Robert Litschko, Ekaterina Artemova, Barbara Plank
Abstract
Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distan
Authors
(none)
Tags
Stats
Related papers
- What Drives Cross-lingual Ranking? Retrieval Approaches With Multilingual Language Models (2025)0.00
- Transfer Learning Approaches For Building Cross-language Dense Retrieval Models (2022)10.97
- Towards Zero-shot Cross-lingual Image Retrieval (2020)2.46
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- On Cross-lingual Retrieval With Multilingual Text Encoders (2021)10.35
- Improving Cross-lingual Information Retrieval On Low-resource Languages Via Optimal Transport Distillation (2023)10.07
- Colbert-xm: A Modular Multi-vector Representation Model For Zero-shot Multilingual Information Retrieval (2024)0.00
- Parameter-efficient Neural Reranking For Cross-lingual And Multilingual Retrieval (2022)0.00