Rebalanced Vision-language Retrieval Considering Structure-aware Distillation
2024 Β· Yang Yang, Wenjuan Xi, Luping Zhou, et al.
Abstract
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly,
Authors
(none)
Tags
Stats
Related papers
- Covlr: Coordinating Cross-modal Consistency And Intra-modal Structure For Vision-language Retrieval (2023)4.52
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Modality-balanced Embedding For Video Retrieval (2022)7.16
- Vision-language Dataset Distillation (2023)0.00
- Cross-modal Fusion Distillation For Fine-grained Sketch-based Image Retrieval (2022)2.68
- Tokenflow: Rethinking Fine-grained Cross-modal Alignment In Vision-language Retrieval (2022)0.00