Uncovering Latent Style Factors For Expressive Speech Synthesis
2017 Β· Yuxuan Wang, Rj Skerry-Ryan, Ying Xiao, et al.
Abstract
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.
Authors
(none)
Tags
Stats
Related papers
- Style Tokens: Unsupervised Style Modeling, Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Learning Utterance-level Representations Through Token-level Acoustic Latents Prediction For Expressive Speech Synthesis (2022)0.00
- Expressive TTS Training With Frame And Style Reconstruction Loss (2020)12.74
- Predicting Expressive Speaking Style From Text In End-to-end Speech Synthesis (2018)14.11
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76