Syllable Discovery And Cross-lingual Generalization In A Visually Grounded, Self-supervised Speech Model
2023 · Puyuan Peng, Shang-Wen Li, Okko Räsänen, et al.
Abstract
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.
Authors
(none)
Tags
Stats
Related papers
- Word Discovery In Visually Grounded, Self-supervised Speech Models (2022)14.08
- Sd-hubert: Sentence-level Self-distillation Induces Syllabic Organization In Hubert (2023)5.24
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- Catplayinginthesnow: Impact Of Prior Segmentation On A Model Of Visually Grounded Speech (2020)4.52
- Pushing The Limits Of Unsupervised Unit Discovery For SSL Speech Representation (2023)6.34
- Self-supervised Contrastive Learning For Unsupervised Phoneme Segmentation (2020)12.68
- Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-supervised Speech Units (2023)4.52
- Self-supervised Representation Learning For Speech Using Visual Grounding And Masked Language Modeling (2022)0.00