Llama Nemoretriever Colembed: Top-performing Text-image Retrieval Model
2025 Β· Mengyao Xu, Gabriel Moreira, Ronay Ak, et al.
Abstract
Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.
Authors
(none)
Tags
Stats
Related papers
- Nemotron Colembed V2: Top-performing Late Interaction Embedding Models For Visual Document Retrieval (2026)0.00
- Tevatron 2.0: Unified Document Retrieval Toolkit Across Scale, Language, And Modality (2025)3.58
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- Llama-embed-nemotron-8b: A Universal Text Embedding Model For Multilingual And Cross-lingual Tasks (2025)0.00
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Retrieval-augmented Multimodal Language Modeling (2022)0.00
- NLLB-CLIP -- Train Performant Multilingual Image Retrieval Model On A Budget (2023)0.00