Clip2video: Mastering Video-text Retrieval Via Image CLIP
2021 Β· Han Fang, Pengfei Xiong, Luhui Xu, et al.
Abstract
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-t
Authors
(none)
Tags
Stats
Related papers
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Revisiting Temporal Modeling For Clip-based Image-to-video Knowledge Transferring (2023)17.40
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00