Watch And Learn: Mapping Language And Noisy Real-world Videos With Self-supervision
2020 Β· Yujie Zhong, Linhai Xie, Sen Wang, et al.
Abstract
In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at https://github.com/zyj-13/WAL.
Authors
(none)
Tags
Stats
Code
- zyj-13/WALβ
Related papers
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Multimodal Clustering Networks For Self-supervised Learning From Unlabeled Videos (2021)13.28
- Separating The "chirp" From The "chat": Self-supervised Visual Grounding Of Sound And Language (2024)7.50
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Improving Spatiotemporal Self-supervision By Deep Reinforcement Learning (2018)13.50
- Sovabench: A Vehicle Surveillance Action Retrieval Benchmark For Multimodal Large Language Models (2026)0.00