Text-video Retrieval With Global-local Semantic Consistent Learning
2024 Β· Haonan Zhang, Pengpeng Zeng, Lianli Gao, et al.
Abstract
Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained alignment. Furthermore, an Inter-Consistency Loss (ICL) is devised to accomplish the concept alignment between
Authors
(none)
Tags
Stats
Related papers
- T2VLAD: Global-local Sequence Alignment For Text-video Retrieval (2021)16.65
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- GOAL: Global-local Object Alignment Learning (2025)2.26
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- DGL: Dynamic Global-local Prompt Tuning For Text-video Retrieval (2024)14.35