Pre-training Multi-modal Dense Retrievers For Outside-knowledge Visual Question Answering
2023 Β· Alireza Salemi, Mahta Rafiee, Hamed Zamani
Abstract
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.
Authors
(none)
Tags
Stats
Related papers
- A Symmetric Dual Encoding Dense Retrieval Framework For Knowledge-intensive Visual Question Answering (2023)9.92
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- Object Retrieval For Visual Question Answering With Outside Knowledge (2024)0.00
- REVEAL: Retrieval-augmented Visual-language Pre-training With Multi-source Multimodal Knowledge Memory (2022)13.65
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- Pixel-grounded Retrieval For Knowledgeable Large Multimodal Models (2026)0.00