Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech
2022 Β· Rongjie Huang, Zhou Zhao, Huadai Liu, et al.
Abstract
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-sp
Authors
(none)
Tags
Stats
Related papers
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Resgrad: Residual Denoising Diffusion Probabilistic Models For Text To Speech (2022)0.00
- Diffgan-tts: High-fidelity And Efficient Text-to-speech With Denoising Diffusion Gans (2022)0.00
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07
- BDDM: Bilateral Denoising Diffusion Models For Fast And High-quality Speech Synthesis (2022)4.76
- Adversarial Training Of Denoising Diffusion Model Using Dual Discriminators For High-fidelity Multi-speaker TTS (2023)2.26
- Speaking In Wavelet Domain: A Simple And Efficient Approach To Speed Up Speech Diffusion Model (2024)5.24
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50