Cross-modal Retrieval Augmentation For Multi-modal Classification
2021 Β· Shir Gur, Natalia Neverova, Chris Stauffer, et al.
Abstract
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- Leveraging Visual Question Answering For Image-caption Ranking (2016)12.10
- X-TRA: Improving Chest X-ray Tasks With Cross-modal Retrieval Augmentation (2023)8.09
- Pre-training Multi-modal Dense Retrievers For Outside-knowledge Visual Question Answering (2023)7.50
- Pixel-grounded Retrieval For Knowledgeable Large Multimodal Models (2026)0.00