Large-scale Representation Learning From Visually Grounded Untranscribed Speech
2019 Β· Gabriel Ilharco, Yuan Zhang, Jason Baldridge
Abstract
Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results---improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality of the retrieved results.
Authors
(none)
Tags
Stats
Related papers
- Leveraging Pretrained Image-text Models For Improving Audio-visual Learning (2023)0.00
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Language Learning Using Speech To Image Retrieval (2019)9.41
- Symbolic Inductive Bias For Visually Grounded Learning Of Spoken Language (2018)5.24
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Semantic Speech Retrieval With A Visually Grounded Model Of Untranscribed Speech (2017)10.61
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Text-free Image-to-speech Synthesis Using Learned Segmental Units (2020)10.85