Distilling Vision-language Models On Millions Of Videos
2024 Β· Yue Zhao, Long Zhao, Xingyi Zhou, et al.
Abstract
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also le
Authors
(none)
Tags
Stats
Related papers
- Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions (2025)6.77
- Panda-70m: Captioning 70M Videos With Multiple Cross-modality Teachers (2024)15.54
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- DREAM: Improving Video-text Retrieval Through Relevance-based Augmentation Using Large Foundation Models (2024)2.26