Training Llms To Be Better Text Embedders Through Bidirectional Reconstruction
2025 Β· Chang Su, Dengliang Shi, Siyuan Huang, et al.
Abstract
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art re
Authors
(none)
Tags
Stats
Related papers
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Transforming Llms Into Cross-modal And Cross-lingual Retrieval Systems (2024)4.52
- Nv-embed: Improved Techniques For Training Llms As Generalist Embedding Models (2024)0.00
- Vill-e: Video LLM Embeddings For Retrieval (2026)0.00
- Lexsembridge: Fine-grained Dense Representation Enhancement Through Token-aware Embedding Augmentation (2025)2.35
- MATE: Meet At The Embedding -- Connecting Images With Long Texts (2024)5.24
- Dewey Long Context Embedding Model: A Technical Report (2025)0.00
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57