M3-embedding: Multi-linguality, Multi-functionality, Multi-granularity Text Embeddings Through Self-knowledge Distillation
2024 Β· Jianlv Chen, Shitao Xiao, Peitian Zhang, et al.
Abstract
In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit\{Multi-Linguality\}, \textit\{Multi-Functionality\}, and \textit\{Multi-Granularity\}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve th
Authors
(none)
Tags
Stats
Related papers
- Rzenembed: Towards Comprehensive Multimodal Retrieval (2025)0.00
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- Metaembed: Scaling Multimodal Retrieval At Test-time With Flexible Late Interaction (2025)2.35
- MULE: Multimodal Universal Language Embedding (2019)9.03
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- M3P: Learning Universal Representations Via Multitask Multilingual Multimodal Pre-training (2020)12.93
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11