Stochastic Pitch Prediction Improves The Diversity And Naturalness Of Speech In Glow-tts
2023 Β· Sewade Ogun, Vincent Colotte, Emmanuel Vincent
Abstract
Flow-based generative models are widely used in text-to-speech (TTS) systems to learn the distribution of audio features (e.g., Mel-spectrograms) given the input tokens and to sample from this distribution to generate diverse utterances. However, in the zero-shot multi-speaker TTS scenario, the generated utterances lack diversity and naturalness. In this paper, we propose to improve the diversity of utterances by explicitly learning the distribution of fundamental frequency sequences (pitch contours) of each speaker during training using a stochastic flow-based pitch predictor, then conditioning the model on generated pitch contours during inference. The experimental results demonstrate that the proposed method yields a significant improvement in the naturalness and diversity of speech generated by a Glow-TTS model that uses explicit stochastic pitch prediction, over a Glow-TTS baseline and an improved Glow-TTS model that uses a stochastic duration predictor.
Authors
(none)
Tags
Stats
Related papers
- Glow-tts: A Generative Flow For Text-to-speech Via Monotonic Alignment Search (2020)0.00
- Generative Modeling For Low Dimensional Speech Attributes With Neural Spline Flows (2022)0.00
- Glowvc: Mel-spectrogram Space Disentangling Model For Language-independent Text-free Voice Conversion (2022)6.34
- Generative Pre-training For Speech With Flow Matching (2023)0.00
- Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching (2025)0.00
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Glow-wavegan: Learning Speech Representations From Gan-based Variational Auto-encoder For High Fidelity Flow-based Speech Synthesis (2021)8.35