Dewey Long Context Embedding Model: A Technical Report
2025 Β· Dun Zhang, Panxiang Zou, Yudong Zhou
Abstract
This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the m
Authors
(none)
Tags
Stats
Related papers
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Late Chunking: Contextual Chunk Embeddings Using Long-context Embedding Models (2024)0.00
- Training Llms To Be Better Text Embedders Through Bidirectional Reconstruction (2025)0.00
- Diffusion-pretrained Dense And Contextual Embeddings (2026)0.00
- Medeir: A Specialized Medical Embedding Model For Enhanced Information Retrieval (2025)0.00
- Context Is Gold To Find The Gold Passage: Evaluating And Training Contextual Document Embeddings (2025)5.62
- M3-embedding: Multi-linguality, Multi-functionality, Multi-granularity Text Embeddings Through Self-knowledge Distillation (2024)19.54
- CEQE: Contextualized Embeddings For Query Expansion (2021)10.35