Progressive Learning For Image Retrieval With Hybrid-modality Queries
2022 Β· Yida Zhao, Yuqing Song, Qin Jin
Abstract
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities. For example, a target product image is searched using a reference product image along with text about changing certain attributes of the reference image as the query. It is a more challenging image retrieval task that requires both semantic space learning and cross-modal fusion. Previous approaches that attempt to deal with both aspects achieve unsatisfactory performance. In this paper, we decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. We first leverage the semantic embedding space for open-domain image-text retrieval, and then transfer the learned knowledge to the fashion-domain with fashion-related pre-training tasks. Finally, we
Authors
(none)
Tags
Stats
Related papers
- Training And Challenging Models For Text-guided Fashion Image Retrieval (2022)0.00
- Compositional Learning Of Image-text Query For Image Retrieval (2020)17.87
- Fashionmv: Product-level Composed Image Retrieval With Multi-view Fashion Data (2026)2.98
- TMCIR: Token Merge Benefits Composed Image Retrieval (2025)0.00
- Composing Text And Image For Image Retrieval - An Empirical Odyssey (2018)18.71
- Cala: Complementary Association Learning For Augmenting Composed Image Retrieval (2024)9.41
- Bi-directional Training For Composed Image Retrieval Via Text Prompt Learning (2023)15.63
- BOSS: Bottom-up Cross-modal Semantic Composition With Hybrid Counterfactual Training For Robust Content-based Image Retrieval (2022)0.00