Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering
2023 Β· Weizhe Lin, Jinghong Chen, Jingbiao Mei, et al.
Abstract
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based r
Authors
(none)
Tags
Stats
Related papers
- A Symmetric Dual Encoding Dense Retrieval Framework For Knowledge-intensive Visual Question Answering (2023)9.92
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00
- Pre-training Multi-modal Dense Retrievers For Outside-knowledge Visual Question Answering (2023)7.50
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- Object Retrieval For Visual Question Answering With Outside Knowledge (2024)0.00
- Developing Visual Augmented Q&A System Using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker (2025)0.00
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35