Revising Image-text Retrieval Via Multi-modal Entailment
2022 Β· Xu Yan, Chunhui Ai, Ziqiang Cao, et al.
Abstract
An outstanding image-text retrieval model depends on high-quality labeled data. While the builders of existing image-text retrieval datasets strive to ensure that the caption matches the linked image, they cannot prevent a caption from fitting other images. We observe that such a many-to-many matching phenomenon is quite common in the widely-used retrieval datasets, where one caption can describe up to 178 images. These large matching-lost data not only confuse the model in training but also weaken the evaluation accuracy. Inspired by visual and textual entailment tasks, we propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions. Subsequently, we revise the image-text retrieval datasets by adding these entailed captions as additional weak labels of an image and develop a universal variable learning rate strategy to teach a retrieval model to distinguish the entailed captions from other negative samples. In experiments
Authors
(none)
Tags
Stats
Related papers
- Using Text To Teach Image Retrieval (2020)5.24
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval (2025)6.77
- Event-retriever: Event-aware Multimodal Image Retrieval For Realistic Captions (2025)0.00
- Retrieval-augmented Image Captioning (2023)11.29
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Improving Image Recognition By Retrieving From Web-scale Image-text Data (2023)9.41