Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling
2022 Β· Dongsheng Chen, Chaofan Tao, Lu Hou, et al.
Abstract
Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though witho
Authors
(none)
Tags
Stats
Related papers
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- Revitalize Region Feature For Democratizing Video-language Pre-training Of Retrieval (2022)2.72
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93