Diffar: Denoising Diffusion Autoregressive Model For Raw Speech Waveform Generation
2023 Β· Roi Benita, Michael Elad, Joseph Keshet
Abstract
Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes t
Authors
(none)
Tags
Stats
Related papers
- Diffwave: A Versatile Diffusion Model For Audio Synthesis (2020)0.00
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Undiff: Unsupervised Voice Restoration With Unconditional Diffusion Model (2023)5.24
- Speech Enhancement And Dereverberation With Diffusion-based Generative Models (2022)23.51
- Speaking In Wavelet Domain: A Simple And Efficient Approach To Speed Up Speech Diffusion Model (2024)5.24
- Storm: A Diffusion-based Stochastic Regeneration Model For Speech Enhancement And Dereverberation (2022)15.43
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07