Howto100m: Learning A Text-video Embedding By Watching Hundred Million Narrated Video Clips
2019 Β· Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al.
Abstract
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show
Authors
(none)
Tags
Stats
Related papers
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Learning A Text-video Embedding From Incomplete And Heterogeneous Data (2018)4.18
- Panda-70m: Captioning 70M Videos With Multiple Cross-modality Teachers (2024)15.54
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- Hierarchical Video-moment Retrieval And Step-captioning (2023)12.54
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Text Is MASS: Modeling As Stochastic Embedding For Text-video Retrieval (2024)13.11