Style Tokens: Unsupervised Style Modeling, Control And Transfer In End-to-end Speech Synthesis
2018 Β· Yuxuan Wang, Daisy Stanton, Yu Zhang, et al.
Abstract
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Authors
(none)
Tags
Stats
Related papers
- Predicting Expressive Speaking Style From Text In End-to-end Speech Synthesis (2018)14.11
- End-to-end Emotional Speech Synthesis Using Style Tokens And Semi-supervised Training (2019)12.87
- Uncovering Latent Style Factors For Expressive Speech Synthesis (2017)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Fine-grained Style Control In Transformer-based Text-to-speech Synthesis (2021)11.19
- Learning Latent Representations For Style Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Expressive Text-to-speech Using Style Tag (2021)10.85