Masked Modeling Duo For Speech: Specializing General-purpose Audio Representation To Speech Using Denoising Distillation
2023 · Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, et al.
Abstract
Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.
Authors
(none)
Tags
Stats
Related papers
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- An Efficient End-to-end Approach To Noise Invariant Speech Features Via Multi-task Learning (2024)0.00
- Single-stage TTS With Masked Audio Token Modeling And Semantic Knowledge Distillation (2024)0.00
- Superm2m: Supervised And Mixture-to-mixture Co-learning For Speech Enhancement And Noise-robust ASR (2024)5.24
- Joint Semantic Knowledge Distillation And Masked Acoustic Modeling For Full-band Speech Restoration With Improved Intelligibility (2024)4.52
- SADDEL: Joint Speech Separation And Denoising Model Based On Multitask Learning (2020)0.00
- Self-supervised Learning Based Monaural Speech Enhancement With Multi-task Pre-training (2021)0.00
- Mad Twinnet: Masker-denoiser Architecture With Twin Networks For Monaural Sound Source Separation (2018)0.00