Equivariant Similarity For Vision-language Foundation Models
2023 Β· Tan Wang, Kevin Lin, Linjie Li, et al.
Abstract
This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. This allows VLMs to generalize better to nuanced and unseen multimodal compositions. However, modeling equivariance is challenging as the ground truth of semantic change is difficult to collect. For example, given an image-text pair about a dog, it is unclear to what extent the similarity changes when the pixel is changed from dog to cat? To this end, we propose EqSim, a regularization loss that can be efficiently calculated from any two matched training pairs and easily pluggable into existing image-text retriev
Authors
(none)
Tags
Stats
Related papers
- Unified Loss Of Pair Similarity Optimization For Vision-language Retrieval (2022)0.00
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- CLAY: Conditional Visual Similarity Modulation In Vision-language Embedding Space (2026)0.00
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Fill The Gap: Quantifying And Reducing The Modality Gap In Image-text Representation Learning (2025)0.00
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00