Panda-70m: Captioning 70M Videos With Multiple Cross-modality Teachers
2024 Β· Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, et al.
Abstract
The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and
Authors
(none)
Tags
Stats
Related papers
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Howto100m: Learning A Text-video Embedding By Watching Hundred Million Narrated Video Clips (2019)21.44
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- Fighting Fire With FIRE: Assessing The Validity Of Text-to-video Retrieval Benchmarks (2022)0.00
- Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions (2025)6.77
- Hierarchical Video-moment Retrieval And Step-captioning (2023)12.54