Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval
2021 Β· Huaishao Luo, Lei Ji, Ming Zhong, et al.
Abstract
Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval data
Authors
(none)
Tags
Stats
Related papers
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Jina CLIP: Your CLIP Model Is Also Your Text Retriever (2024)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- An Empirical Study Of Excitation And Aggregation Design Adaptions In Clip4clip For Video-text Retrieval (2024)4.52