Learning Hierarchical Discrete Linguistic Units From Visually-grounded Speech
2019 Β· David Harwath, Wei-Ning Hsu, James Glass
Abstract
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher
Authors
(none)
Tags
Stats
Related papers
- Separating The "chirp" From The "chat": Self-supervised Visual Grounding Of Sound And Language (2024)7.50
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- Cross-modal Discrete Representation Learning (2021)10.61
- Hindi As A Second Language: Improving Visually Grounded Speech With Semantically Similar Samples (2023)6.77
- Vqtoken: Neural Discrete Token Representation Learning For Extreme Token Reduction In Video Large Language Models (2025)0.00
- Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding (2025)0.00
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87