Towards Localisation Of Keywords In Speech Using Weak Supervision
2020 Β· Kayode Olaleye, Benjamin van Niekerk, Herman Kamper
Abstract
Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly. In the first, only the presence or absence of a word is indicated, i.e. a bag-of-words (BoW) labelling. In the second, visual context is provided in the form of an image paired with an unlabelled utterance; a model then needs to be trained in a self-supervised fashion using the paired data. For keyword localisation, we adapt a saliency-based method typically used in the vision domain. We compare this to an existing technique that performs localisation as a part of the network architecture. While the saliency-based method is more flexible (it can be applied without architectural restrictions), we identify a critical limitation when using it for keyword localisation. Of the two forms of
Authors
(none)
Tags
Stats
Related papers
- Keyword Localisation In Untranscribed Speech Using Visually Grounded Speech Models (2022)6.34
- Visually Grounded Keyword Detection And Localisation For Low-resource Languages (2023)0.00
- On The Contributions Of Visual And Textual Supervision In Low-resource Semantic Speech Retrieval (2019)6.34
- Exploring Representation Learning For Small-footprint Keyword Spotting (2023)3.58
- Contrastive Augmentation: An Unsupervised Learning Approach For Keyword Spotting In Speech Technology (2024)9.92
- Small-footprint Open-vocabulary Keyword Spotting With Quantized LSTM Networks (2020)0.00
- I See What You Hear: A Vision-inspired Method To Localize Words (2022)0.00
- Language-universal Speech Attributes Modeling For Zero-shot Multilingual Spoken Keyword Recognition (2024)0.00