Multimodal Hypothetical Summary For Retrieval-based Multi-image Question Answering
2024 Β· Peize Li, Qingyi Si, Peng Fu, et al.
Abstract
Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to a
Authors
(none)
Tags
Stats
Related papers
- An Interactive Multi-modal Query Answering System With Retrieval-augmented Large Language Models (2024)5.84
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- Pixel-grounded Retrieval For Knowledgeable Large Multimodal Models (2026)0.00
- VQA4CIR: Boosting Composed Image Retrieval With Visual Question Answering (2023)5.24
- Detect, Describe, Discriminate: Moving Beyond VQA For MLLM Evaluation (2024)0.00