Visually Grounded Keyword Detection And Localisation For Low-resource Languages
2023 Β· Kayode Kolawole Olaleye
Abstract
This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword localisation in low-resource settings.
Authors
(none)
Tags
Stats
Related papers
- Visually Grounded Speech Models For Low-resource Languages And Cognitive Modelling (2024)0.00
- YFACC: A Yor\`ub\'a Speech-image Dataset For Cross-lingual Keyword Localisation Through Visual Grounding (2022)4.52
- Keyword Localisation In Untranscribed Speech Using Visually Grounded Speech Models (2022)6.34
- Towards Localisation Of Keywords In Speech Using Weak Supervision (2020)0.00
- Hindi As A Second Language: Improving Visually Grounded Speech With Semantically Similar Samples (2023)6.77
- On The Contributions Of Visual And Textual Supervision In Low-resource Semantic Speech Retrieval (2019)6.34
- Exploring Representation Learning For Small-footprint Keyword Spotting (2023)3.58
- Semantic Speech Retrieval With A Visually Grounded Model Of Untranscribed Speech (2017)10.61