RAVENEA: A Benchmark For Multimodal Retrieval-augmented Visual Culture Understanding
2025 Β· Jiaang Li, Yifei Yuan, Wenyan Li, et al.
Abstract
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 11,396 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings
Authors
(none)
Tags
Stats
Related papers
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- RAVID: Retrieval-augmented Visual Detection: A Knowledge-driven Approach For Ai-generated Image Identification (2025)0.00
- RAVEN: In-context Learning With Retrieval-augmented Encoder-decoder Language Models (2023)0.00