Adversarial Training Of Denoising Diffusion Model Using Dual Discriminators For High-fidelity Multi-speaker TTS
2023 Β· Myeongjin Ko, Yong-Hoon Choi
Abstract
The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distr
Authors
(none)
Tags
Stats
Related papers
- Diffgan-tts: High-fidelity And Efficient Text-to-speech With Denoising Diffusion Gans (2022)0.00
- BDDM: Bilateral Denoising Diffusion Models For Fast And High-quality Speech Synthesis (2022)4.76
- Single And Few-step Diffusion For Generative Speech Enhancement (2023)10.21
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech (2022)0.00
- DDTSE: Discriminative Diffusion Model For Target Speech Extraction (2023)5.84
- Resgrad: Residual Denoising Diffusion Probabilistic Models For Text To Speech (2022)0.00
- Multi-gradspeech: Towards Diffusion-based Multi-speaker Text-to-speech Using Consistent Diffusion Models (2023)0.00