Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS
2021 Β· Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri
Abstract
Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS front-end model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise emphasis control, while maintaining equal or better quality to the baseline model.
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Emphasis Control For Parallel Neural TTS (2021)6.77
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76
- Hignn-tts: Hierarchical Prosody Modeling With Graph Neural Networks For Expressive Long-form TTS (2023)5.84
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Fully-hierarchical Fine-grained Prosody Modeling For Interpretable Speech Synthesis (2020)13.28
- Hierarchical Generative Modeling For Controllable Speech Synthesis (2018)0.00
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00