Self-enhancement Improves Text-image Retrieval In Foundation Visual-language Models
2023 Β· Yuguang Yang, Yiming Wang, Shupeng Geng, et al.
Abstract
The emergence of cross-modal foundation models has introduced numerous approaches grounded in text-image retrieval. However, on some domain-specific retrieval tasks, these models fail to focus on the key attributes required. To address this issue, we propose a self-enhancement framework, A^\{3\}R, based on the CLIP-ViT/G-14, one of the largest cross-modal models. First, we perform an Attribute Augmentation strategy to enrich the textual description for fine-grained representation before model learning. Then, we propose an Adaption Re-ranking method to unify the representation space of textual query and candidate images and re-rank candidate images relying on the adapted query after model learning. The proposed framework is validated to achieve a salient improvement over the baseline and other teams' solutions in the cross-modal image retrieval track of the 1st foundation model challenge without introducing any additional samples. The code is available at https://github.com/CapricornGua
Authors
(none)
Tags
Stats
Related papers
- Enhancing Recipe Retrieval With Foundation Models: A Data Augmentation Perspective (2023)6.77
- DREAM: Improving Video-text Retrieval Through Relevance-based Augmentation Using Large Foundation Models (2024)2.26
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Benchmark Granularity And Model Robustness For Image-text Retrieval (2024)0.00
- Enhancing Image-text Matching With Adaptive Feature Aggregation (2024)6.34
- The Solution For The CVPR 2023 1st Foundation Model Challenge-track2 (2024)0.00
- Cross-modal Attribute Insertions For Assessing The Robustness Of Vision-and-language Learning (2023)2.00
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33