MLLM Is A Strong Reranker: Advancing Multimodal Retrieval-augmented Generation Via Knowledge-enhanced Reranking And Noise-injected Training
2024 Β· Zhanpeng Chen, Chengjin Xu, Yiyan Qi, et al.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. Though integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which hinders accurate retrieval and generation. In this work, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the to
Authors
(none)
Tags
Stats
Related papers
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Domain-aware RAG: Mol-enhanced RL For Efficient Training And Scalable Retrieval (2025)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57
- Generative Cross-modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond (2024)8.35