On The Contributions Of Visual And Textual Supervision In Low-resource Semantic Speech Retrieval
2019 Β· Ankita Pasad, Bowen Shi, Herman Kamper, et al.
Abstract
Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision. In real-world low-resource settings, however, we often have access to some transcribed speech. We study whether and how visual grounding is useful in the presence of varying amounts of textual supervision. In particular, we consider the task of semantic speech retrieval in a low-resource setting. We use a previously studied data set and task, where models are trained on images with spoken captions and evaluated on human judgments of semantic relevance. We propose a multitask learning approach to leverage both visual and textual modalities, with visual supervision in the form of keyword probabilities from an external tagger. We find that visual grounding is helpful even in the presence of textual supervision, and we analyze this effect over a range of sizes of transcribed data sets. With ~5 hours of transcribed speech, we obtain 2
Authors
(none)
Tags
Stats
Related papers
- Semantic Speech Retrieval With A Visually Grounded Model Of Untranscribed Speech (2017)10.61
- Towards Localisation Of Keywords In Speech Using Weak Supervision (2020)0.00
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Hindi As A Second Language: Improving Visually Grounded Speech With Semantically Similar Samples (2023)6.77
- Symbolic Inductive Bias For Visually Grounded Learning Of Spoken Language (2018)5.24
- Semantic Query-by-example Speech Search Using Visual Grounding (2019)7.81
- Visually Grounded Speech Models For Low-resource Languages And Cognitive Modelling (2024)0.00
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84