Leveraging Visual Question Answering For Image-caption Ranking
2016 Β· Xiao Lin, Devi Parikh
Abstract
Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a "feature extraction" module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.
Authors
(none)
Tags
Stats
Related papers
- VQA4CIR: Boosting Composed Image Retrieval With Visual Question Answering (2023)5.24
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- From Known To The Unknown: Transferring Knowledge To Answer Questions About Novel Visual And Semantic Concepts (2018)8.82
- Detect, Describe, Discriminate: Moving Beyond VQA For MLLM Evaluation (2024)0.00
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- Retrieval-augmented Image Captioning (2023)11.29
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00