Resgrad: Residual Denoising Diffusion Probabilistic Models For Text To Speech
2022 Β· Zehua Chen, Yihan Wu, Yichong Leng, et al.
Abstract
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target f
Authors
(none)
Tags
Stats
Related papers
- Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech (2022)0.00
- Diffgan-tts: High-fidelity And Efficient Text-to-speech With Denoising Diffusion Gans (2022)0.00
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Specgrad: Diffusion Probabilistic Model Based Neural Vocoder With Adaptive Noise Spectral Shaping (2022)11.49
- BDDM: Bilateral Denoising Diffusion Models For Fast And High-quality Speech Synthesis (2022)4.76
- Speaking In Wavelet Domain: A Simple And Efficient Approach To Speed Up Speech Diffusion Model (2024)5.24
- Adversarial Training Of Denoising Diffusion Model Using Dual Discriminators For High-fidelity Multi-speaker TTS (2023)2.26
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50