Keyword Localisation In Untranscribed Speech Using Visually Grounded Speech Models
2022 Β· Kayode Olaleye, Dan Oneata, Herman Kamper
Abstract
Keyword localisation is the task of finding where in a speech utterance a given query keyword occurs. We investigate to what extent keyword localisation is possible using a visually grounded speech (VGS) model. VGS models are trained on unlabelled images paired with spoken captions. These models are therefore self-supervised -- trained without any explicit textual label or location information. To obtain training targets, we first tag training images with soft text labels using a pretrained visual classifier with a fixed vocabulary. This enables a VGS model to predict the presence of a written keyword in an utterance, but not its location. We consider four ways to equip VGS models with localisations capabilities. Two of these -- a saliency approach and input masking -- can be applied to an arbitrary prediction model after training, while the other two -- attention and a score aggregation approach -- are incorporated directly into the structure of the model. Masked-based localisation gi
Authors
(none)
Tags
Stats
Related papers
- Towards Localisation Of Keywords In Speech Using Weak Supervision (2020)0.00
- Visually Grounded Keyword Detection And Localisation For Low-resource Languages (2023)0.00
- Visually Grounded Speech Models For Low-resource Languages And Cognitive Modelling (2024)0.00
- Semantic Speech Retrieval With A Visually Grounded Model Of Untranscribed Speech (2017)10.61
- I See What You Hear: A Vision-inspired Method To Localize Words (2022)0.00
- YFACC: A Yor\`ub\'a Speech-image Dataset For Cross-lingual Keyword Localisation Through Visual Grounding (2022)4.52
- Visual Keyword Spotting With Attention (2021)2.26
- Semantic Query-by-example Speech Search Using Visual Grounding (2019)7.81