Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations
2022 Β· Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, et al.
Abstract
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio
Authors
(none)
Tags
Stats
Related papers
- Prosodic Clustering For Phoneme-level Prosody Control In End-to-end Speech Synthesis (2021)5.84
- Improved Prosodic Clustering For Multispeaker And Speaker-independent Phoneme-level Prosody Control (2021)0.00
- Perception Of Prosodic Variation For Speech Synthesis Using An Unsupervised Discrete Representation Of F0 (2020)7.81
- Unsupervised Word-level Prosody Tagging For Controllable Speech Synthesis (2022)7.16
- Controllable Prosody Generation With Partial Inputs (2023)3.58
- Supervised And Unsupervised Approaches For Controlling Narrow Lexical Focus In Sequence-to-sequence Speech Synthesis (2021)7.50
- Prosodyfm: Unsupervised Phrasing And Intonation Control For Intelligible Speech Synthesis (2024)0.00
- Speech Resynthesis From Discrete Disentangled Self-supervised Representations (2021)16.25