Towards Cross-modal Backward-compatible Representation Learning For Vision-language Models
2024 Β· Young Kyun Jang, Ser-Nam Lim
Abstract
Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, on
Authors
(none)
Tags
Stats
Related papers
- CL2CM: Improving Cross-lingual Cross-modal Retrieval Via Cross-lingual Knowledge Transfer (2023)8.60
- Towards Backward-compatible Representation Learning (2020)12.93
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Forward Compatible Training For Large-scale Embedding Retrieval Systems (2021)8.09
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Boundary-aware Backward-compatible Representation Via Adversarial Learning In Image Retrieval (2023)9.21