Zero-shot Composed Text-image Retrieval
2023 Β· Yikun Liu, Jiangchao Yao, Ya Zhang, et al.
Abstract
In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directl
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs (2024)5.24
- Training-free Zero-shot Composed Image Retrieval Via Weighted Modality Fusion And Similarity (2024)5.84
- Image2sentence Based Asymmetrical Zero-shot Composed Image Retrieval (2024)0.00
- Zero Shot Composed Image Retrieval (2025)1.57
- SCOT: Self-supervised Contrastive Pretraining For Zero-shot Compositional Retrieval (2025)0.00