MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation

Abstract

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf\{MG\(^2\)-RAG\}, a lightweight \textbf\{M\}ulti-\textbf\{G\}ranularity \textbf\{G\}raph \textbf\{RAG\} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG\(^2\)-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities

MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation

Abstract

Authors

Tags

Stats

Related papers