A Symmetric Dual Encoding Dense Retrieval Framework For Knowledge-intensive Visual Question Answering
2023 Β· Alireza Salemi, Juan Altmayer Pizzorno, Hamed Zamani
Abstract
Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA, respectively. Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for generating a textual answer for KI-VQA tasks. MM-FiD encodes the question, the image, and each retri
Authors
(none)
Tags
Stats
Related papers
- Pre-training Multi-modal Dense Retrievers For Outside-knowledge Visual Question Answering (2023)7.50
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- From Known To The Unknown: Transferring Knowledge To Answer Questions About Novel Visual And Semantic Concepts (2018)8.82
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- Index Light, Reason Deep: Deferred Visual Ingestion For Visual-dense Document Question Answering (2026)0.00
- REVEAL: Retrieval-augmented Visual-language Pre-training With Multi-source Multimodal Knowledge Memory (2022)13.65
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45