VQA4CIR: Boosting Composed Image Retrieval With Visual Question Answering
2023 Β· Chun-Mei Feng, Yang Bai, Tao Luo, et al.
Abstract
Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can fi
Authors
(none)
Tags
Stats
Related papers
- Leveraging Visual Question Answering For Image-caption Ranking (2016)12.10
- Detect, Describe, Discriminate: Moving Beyond VQA For MLLM Evaluation (2024)0.00
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Instance-level Composed Image Retrieval (2025)0.00
- Sentence-level Prompts Benefit Composed Image Retrieval (2023)3.95
- Good4cir: Generating Detailed Synthetic Captions For Composed Image Retrieval (2025)0.00
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00