Modeling Sequential Sentence Relation To Improve Cross-lingual Dense Retrieval
2023 Β· Shunyu Zhang, Yaobo Liang, Ming Gong, et al.
Abstract
Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence v
Authors
(none)
Tags
Stats
Related papers
- Unsupervised Context Aware Sentence Representation Pretraining For Multi-lingual Dense Retrieval (2022)3.58
- Massively Multilingual Sentence Embeddings For Zero-shot Cross-lingual Transfer And Beyond (2018)26.33
- On Cross-lingual Retrieval With Multilingual Text Encoders (2021)10.35
- SLQ: Bridging Modalities Via Shared Latent Queries For Retrieval With Frozen Mllms (2026)0.00
- Investigating Multi-layer Representations For Dense Passage Retrieval (2025)0.00
- Colbert-xm: A Modular Multi-vector Representation Model For Zero-shot Multilingual Information Retrieval (2024)0.00
- Transfer Learning Approaches For Building Cross-language Dense Retrieval Models (2022)10.97
- Simlm: Pre-training With Representation Bottleneck For Dense Passage Retrieval (2022)20.27