M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG
2025 Β· David Anugraha, Patrick Amadeus Irawan, Anshul Singh, et al.
Abstract
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark spanning 42 languages, 56 regional dialects and registers, and 189 countries, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to sca
Authors
(none)
Tags
Stats
Related papers
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- SV-RAG: Lora-contextualizing Adaptation Of Mllms For Long Document Understanding (2024)0.00
- Are We On The Right Way For Assessing Document Retrieval-augmented Generation? (2025)0.00
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00
- Rag-check: Evaluating Multimodal Retrieval Augmented Generation Performance (2025)0.00
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00