Adapting Dual-encoder Vision-language Models For Paraphrased Retrieval
2024 Β· Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, et al.
Abstract
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a lar
Authors
(none)
Tags
Stats
Related papers
- Fine-tuning CLIP Text Encoders With Two-step Paraphrasing (2024)2.26
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- Understanding Retrieval-augmented Task Adaptation For Vision-language Models (2024)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Dualcap: Enhancing Lightweight Image Captioning Via Dual Retrieval With Similar Scenes Visual Prompts (2025)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21