Learning Hierarchical Discrete Linguistic Units From Visually-grounded Speech
2019 Β· David Harwath, Wei-Ning Hsu, James Glass
Abstract
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher
Authors
(none)
Tags
Stats
Related papers
- Vector-quantized Neural Networks For Acoustic Unit Discovery In The Zerospeech 2020 Challenge (2020)13.50
- Towards Visually Grounded Sub-word Speech Unit Discovery (2019)9.03
- Unsupervised Acoustic Unit Discovery For Speech Synthesis Using Discrete Latent-variable Neural Networks (2019)9.59
- Learning Word-like Units From Joint Audio-visual Analysis (2017)12.33
- Unsupervised End-to-end Learning Of Discrete Linguistic Units For Voice Conversion (2019)9.03
- Catplayinginthesnow: Impact Of Prior Segmentation On A Model Of Visually Grounded Speech (2020)4.52
- Combining Adversarial Training And Disentangled Speech Representation For Robust Zero-resource Subword Modeling (2019)7.16
- Language Learning Using Speech To Image Retrieval (2019)9.41