Towards Fast And Accurate Image-text Retrieval With Self-supervised Fine-grained Alignment
2023 Β· Jiamin Zhuang, Jing Yu, Yang Ding, et al.
Abstract
Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require cross-modal embedding interactions during training while maintaining independent image and text encoders during
Authors
(none)
Tags
Stats
Related papers
- ALADIN: Distilling Fine-grained Alignment Scores For Efficient Image-text Matching And Retrieval (2022)14.00
- A New Fine-grained Alignment Method For Image-text Matching (2023)0.00
- Learning Relation Alignment For Calibrated Cross-modal Retrieval (2021)8.82
- Cross-modal And Uni-modal Soft-label Alignment For Image-text Retrieval (2024)15.75
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Modest-align: Data-efficient Alignment For Vision-language Models (2025)0.00
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16