Towards Visually Grounded Sub-word Speech Unit Discovery
2019 Β· David Harwath, James Glass
Abstract
In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.
Authors
(none)
Tags
Stats
Related papers
- Learning Hierarchical Discrete Linguistic Units From Visually-grounded Speech (2019)0.00
- Word Recognition, Competition, And Activation In A Model Of Visually Grounded Speech (2019)0.00
- Word Discovery In Visually Grounded, Self-supervised Speech Models (2022)14.08
- Learning Word-like Units From Joint Audio-visual Analysis (2017)12.33
- Catplayinginthesnow: Impact Of Prior Segmentation On A Model Of Visually Grounded Speech (2020)4.52
- Representations Of Language In A Model Of Visually Grounded Speech Signal (2017)12.02
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Syllable Discovery And Cross-lingual Generalization In A Visually Grounded, Self-supervised Speech Model (2023)7.81