Unsupervised Context Aware Sentence Representation Pretraining For Multi-lingual Dense Retrieval
2022 Β· Ning Wu, Yaobo Liang, Houxing Ren, et al.
Abstract
Recent research demonstrates the effectiveness of using pretrained language models (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction~(CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performa
Authors
(none)
Tags
Stats
Related papers
- Modeling Sequential Sentence Relation To Improve Cross-lingual Dense Retrieval (2023)1.20
- Query-as-context Pre-training For Dense Passage Retrieval (2022)7.68
- CLASP: Contrastive Language-speech Pretraining For Multilingual Multimodal Information Retrieval (2024)0.00
- Diffusion-pretrained Dense And Contextual Embeddings (2026)0.00
- Compressing Then Matching: An Efficient Pre-training Paradigm For Multimodal Embedding (2025)0.00
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- In-context Pretraining: Language Modeling Beyond Document Boundaries (2023)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00