Dmospeech: Direct Metric Optimization Via Distilled Diffusion Model In Zero-shot Speech Synthesis
2024 Β· Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin
Abstract
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, int
Authors
(none)
Tags
Stats
Related papers
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Minimally-supervised Speech Synthesis With Conditional Diffusion Model And Language Model: A Comparative Study Of Semantic Coding (2023)8.82
- DLPO: Diffusion Model Loss-guided Reinforcement Learning For Fine-tuning Text-to-speech Diffusion Models (2024)0.00
- BDDM: Bilateral Denoising Diffusion Models For Fast And High-quality Speech Synthesis (2022)4.76
- Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech (2022)0.00
- Naturalspeech 3: Zero-shot Speech Synthesis With Factorized Codec And Diffusion Models (2024)0.00
- Diffusion Synthesizer For Efficient Multilingual Speech To Speech Translation (2024)0.00