Revisiting Temporal Modeling For Clip-based Image-to-video Knowledge Transferring
2023 Β· Ruyang Liu, Jingjia Huang, Ge Li, et al.
Abstract
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP mo
Authors
(none)
Tags
Stats
Related papers
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- TF-CLIP: Learning Text-free CLIP For Video-based Person Re-identification (2023)15.81
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86