Less Is More: Clipbert For Video-and-language Learning Via Sparse Sampling
2021 Β· Jie Lei, Linjie Li, Luowei Zhou, et al.
Abstract
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks different from the target domains, rendering these fixed features sub-optimal for downstream tasks. Moreover, due to the high computational overload of dense video features, it is often difficult (or infeasible) to plug feature extractors directly into existing approaches for easy finetuning. To provide a remedy to this dilemma, we propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or i
Authors
(none)
Tags
Stats
Related papers
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- Expectation-maximization Contrastive Learning For Compact Video-and-language Representations (2022)2.26
- Contrastive Video-language Learning With Fine-grained Frame Sampling (2022)6.77
- Prompt-aware Of Frame Sampling For Efficient Text-video Retrieval (2025)0.95
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54