X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval
2022 Β· Yiwei Ma, Guohai Xu, Xiaoshuai Sun, et al.
Abstract
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregat
Authors
(none)
Tags
Stats
Related papers
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- TC-MGC: Text-conditioned Multi-grained Contrastive Learning For Text-video Retrieval (2025)6.93
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- Normalized Contrastive Learning For Text-video Retrieval (2022)6.77
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- Crossclr: Cross-modal Contrastive Learning For Multi-modal Video Representations (2021)15.59