Towards A Generalist Code Embedding Model Based On Massive Data Synthesis
2025 Β· Chaofan Li, Jianlyu Chen, Yingxia Shao, et al.
Abstract
Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbf\{CodeR\} (\underline\{Code\} \underline\{R\}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous sources of data. We evaluate CodeR based on 16 diverse code retrieval tasks, where it significant
Authors
(none)
Tags
Stats
Related papers
- Cornstack: High-quality Contrastive Data For Better Code Retrieval And Reranking (2024)0.00
- On The Challenges And Opportunities Of Learned Sparse Retrieval For Code (2026)0.00
- REFINE On Scarce Data: Retrieval Enhancement Through Fine-tuning Via Model Fusion Of Embedding Models (2024)3.58
- Learning Deep Semantic Model For Code Search Using Codesearchnet Corpus (2022)3.16
- Advancing Retrieval-augmented Generation For Structured Enterprise And Internal Data (2025)1.20
- CODER: An Efficient Framework For Improving Retrieval Through Contextual Document Embedding Reranking (2021)7.16
- Hetarag: Hybrid Deep Retrieval-augmented Generation Across Heterogeneous Data Stores (2025)3.27
- Practical Code RAG At Scale: Task-aware Retrieval Design Choices Under Compute Budgets (2025)0.00