Unsupervised Word-level Prosody Tagging For Controllable Speech Synthesis
2022 Β· Yiwei Guo, Chenpeng Du, Kai Yu
Abstract
Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 mo
Authors
(none)
Tags
Stats
Related papers
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76
- Prosodic Clustering For Phoneme-level Prosody Control In End-to-end Speech Synthesis (2021)5.84
- Semi-supervised Generative Modeling For Controllable Speech Synthesis (2019)0.00
- Multi-modal Automatic Prosody Annotation With Contrastive Pretraining Of SSWP (2023)0.00