Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment
2022 Β· Hongwei Xue, Yuchong Sun, Bei Liu, et al.
Abstract
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our
Authors
(none)
Tags
Stats
Related papers
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26