Word Discovery In Visually Grounded, Self-supervised Speech Models
2022 Β· Puyuan Peng, David Harwath
Abstract
We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we perform on par with or better than currently published methods on several metrics. Code and model weights are available at https://github.com/jasonppy/word-discovery.
Authors
(none)
Tags
Stats
Code
Related papers
- Syllable Discovery And Cross-lingual Generalization In A Visually Grounded, Self-supervised Speech Model (2023)7.81
- Integrating Self-supervised Speech Model With Pseudo Word-level Targets From Visually-grounded Speech Model (2024)3.58
- Towards Visually Grounded Sub-word Speech Unit Discovery (2019)9.03
- Word Recognition, Competition, And Activation In A Model Of Visually Grounded Speech (2019)0.00
- Semantic Speech Retrieval With A Visually Grounded Model Of Untranscribed Speech (2017)10.61
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Learning Word-like Units From Joint Audio-visual Analysis (2017)12.33
- Learning Hierarchical Discrete Linguistic Units From Visually-grounded Speech (2019)0.00