Diffwave: A Versatile Diffusion Model For Audio Synthesis
2020 Β· Zhifeng Kong, Wei Ping, Jiaji Huang, et al.
Abstract
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
Authors
(none)
Tags
Stats
Related papers
- Diffar: Denoising Diffusion Autoregressive Model For Raw Speech Waveform Generation (2023)0.00
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Undiff: Unsupervised Voice Restoration With Unconditional Diffusion Model (2023)5.24
- Wavefit: An Iterative And Non-autoregressive Neural Vocoder Based On Fixed-point Iteration (2022)9.41
- Edmsound: Spectrogram Based Diffusion Models For Efficient And High-quality Audio Synthesis (2023)0.00
- Wavenet: A Generative Model For Raw Audio (2016)0.00
- Nu-wave: A Diffusion Probabilistic Model For Neural Audio Upsampling (2021)12.40
- Specdiff-gan: A Spectrally-shaped Noise Diffusion GAN For Speech And Music Synthesis (2024)7.81