Pushing The Limits Of Unsupervised Unit Discovery For SSL Speech Representation
2023 Β· Ziyang Ma, Zhisheng Zheng, Guanrou Yang, et al.
Abstract
The excellent generalization ability of self-supervised learning (SSL) for speech foundation models has garnered significant attention. HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task. However, simply clustering features as targets by k-means does not fully inspire the model's performance. In this work, we present an unsupervised method to improve SSL targets. Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training. Our models outperform other SSL models significantly on the LibriSpeech benchmark without the need for iterative re-clustering and re-training. Furthermore, our models equipped with context-dependent units even outperform target-improvement models that use labeled data during pre-training. How we progressively improve the unit discovery process is demonstrated through experiments.
Authors
(none)
Tags
Stats
Related papers
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- Fithubert: Going Thinner And Deeper For Knowledge Distillation Of Speech Self-supervised Learning (2022)10.97
- Sd-hubert: Sentence-level Self-distillation Induces Syllabic Organization In Hubert (2023)5.24
- Understanding Self-supervised Learning Of Speech Representation Via Invariance And Redundancy Reduction (2023)0.00
- Self-supervised Learning For Speech Recognition With Intermediate Layer Supervision (2021)9.41