See, Hear, And Read: Deep Aligned Representations
2017 · Yusuf Aytar, Carl Vondrick, Antonio Torralba
Abstract
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.
Authors
(none)
Tags
Stats
Related papers
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Learning Aligned Cross-modal Representations From Weakly Aligned Data (2016)14.97
- Deep Latent Space Learning For Cross-modal Mapping Of Audio And Visual Signals (2019)12.17
- Objects That Sound (2017)0.00
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- Cross-modal Discrete Representation Learning (2021)10.61
- Speech-image Semantic Alignment Does Not Depend On Any Prior Classification Tasks (2020)3.58
- Learning From Multiview Correlations In Open-domain Videos (2018)5.84