Nemotron Colembed V2: Top-performing Late Interaction Embedding Models For Visual Document Retrieval
2026 Β· Gabriel de Souza P. Moreira, Ronay Ak, Mengyao Xu, et al.
Abstract
Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.
Authors
(none)
Tags
Stats
Related papers
- Llama Nemoretriever Colembed: Top-performing Text-image Retrieval Model (2025)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- SERVAL: Surprisingly Effective Zero-shot Visual Document Retrieval Powered By Large Vision And Language Models (2025)0.00
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Rethinking Hybrid Retrieval: When Small Embeddings And LLM Re-ranking Beat Bigger Models (2025)0.00