Covlr: Coordinating Cross-modal Consistency And Intra-modal Structure For Vision-language Retrieval
2023 Β· Yang Yang, Zhongtian Fu, Xiangyu Wu, et al.
Abstract
Current vision-language retrieval aims to perform cross-modal instance search, in which the core idea is to learn the consistent visionlanguage representations. Although the performance of cross-modal retrieval has greatly improved with the development of deep models, we unfortunately find that traditional hard consistency may destroy the original relationships among single-modal instances, leading the performance degradation for single-modal retrieval. To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations.To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Coordination Across A Diverse Set Of Input Modalities (2024)0.00
- Improving The Consistency In Cross-lingual Cross-modal Retrieval With 1-to-k Contrastive Learning (2024)5.84
- Rebalanced Vision-language Retrieval Considering Structure-aware Distillation (2024)2.26
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Deep Reversible Consistency Learning For Cross-modal Retrieval (2025)7.81
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00