Robustvisrag: Causality-aware Vision-based Retrieval-augmented Generation Under Visual Degradations
2026 Β· I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, et al.
Abstract
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between seman
Authors
(none)
Tags
Stats
Related papers
- Visrag 2.0: Evidence-guided Multi-image Reasoning In Visual Retrieval-augmented Generation (2025)0.00
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- VISOR: Agentic Visual Retrieval-augmented Generation Via Iterative Search And Over-horizon Reasoning (2026)0.00
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- RAVID: Retrieval-augmented Visual Detection: A Knowledge-driven Approach For Ai-generated Image Identification (2025)0.00
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00