Comparing Normalizing Flows And Diffusion Models For Prosody And Acoustic Modelling In Text-to-speech
2023 · Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, et al.
Abstract
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.
Authors
(none)
Tags
Stats
Related papers
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50
- Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech (2022)0.00
- Predicting Phoneme-level Prosody Latents Using AR And Flow-based Prior Networks For Expressive Speech Synthesis (2022)0.00
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Generative Modeling For Low Dimensional Speech Attributes With Neural Spline Flows (2022)0.00