Dinosr: Self-distillation And Online Clustering For Self-supervised Speech Representation Learning
2023 Β· Alexander H. Liu, Heng-Jui Chang, Michael Auli, et al.
Abstract
In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Reflective Learning Through Self-distillation And Online Clustering For Speaker Representation Learning (2024)2.26
- Self-supervised Learning With Cluster-aware-dino For High-performance Robust Speaker Verification (2023)0.00
- Deep Self-supervised Hierarchical Clustering For Speaker Diarization (2020)5.24
- Pushing The Limits Of Self-supervised Speaker Verification Using Regularized Distillation Framework (2022)17.00
- Self-distillation Prototypes Network: Learning Robust Speaker Representations Without Supervision (2023)4.52
- DINO-VITS: Data-efficient Zero-shot TTS With Self-supervised Speaker Verification Loss For Noise Robustness (2023)3.58
- Textless Acoustic Model With Self-supervised Distillation For Noise-robust Expressive Speech-to-speech Translation (2024)3.58
- A Reinforcement Learning Framework For Online Speaker Diarization (2023)0.00