Towards Mixed-modal Retrieval For Universal Retrieval-augmented Generation
2025 Β· Chenghao Zhang, Guanting Dong, Xinyu Yang, et al.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-qua
Authors
(none)
Tags
Stats
Related papers
- Universalrag: Retrieval-augmented Generation Over Corpora Of Diverse Modalities And Granularities (2025)0.00
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- Multimodal RAG For Unstructured Data:leveraging Modality-aware Knowledge Graphs With Hybrid Retrieval (2025)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Are We On The Right Way For Assessing Document Retrieval-augmented Generation? (2025)0.00
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00
- Rag-check: Evaluating Multimodal Retrieval Augmented Generation Performance (2025)0.00