SV-RAG: Lora-contextualizing Adaptation Of Mllms For Long Document Understanding
2024 Β· Jian Chen, Ruiyi Zhang, Yufan Zhou, et al.
Abstract
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effective
Authors
(none)
Tags
Stats
Related papers
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- Are We On The Right Way For Assessing Document Retrieval-augmented Generation? (2025)0.00
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- MLLM Is A Strong Reranker: Advancing Multimodal Retrieval-augmented Generation Via Knowledge-enhanced Reranking And Noise-injected Training (2024)9.18
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Multi-head RAG: Solving Multi-aspect Problems With Llms (2024)0.00
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00