Improved Prosodic Clustering For Multispeaker And Speaker-independent Phoneme-level Prosody Control
2021 Β· Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, et al.
Abstract
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output spe
Authors
(none)
Tags
Stats
Related papers
- Prosodic Clustering For Phoneme-level Prosody Control In End-to-end Speech Synthesis (2021)5.84
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Perception Of Prosodic Variation For Speech Synthesis Using An Unsupervised Discrete Representation Of F0 (2020)7.81
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Unsupervised Word-level Prosody Tagging For Controllable Speech Synthesis (2022)7.16
- Fine-grained Robust Prosody Transfer For Single-speaker Neural Text-to-speech (2019)0.00
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07