Fine-grained Style Control In Transformer-based Text-to-speech Synthesis
2021 Β· Li-Wei Chen, Alexander Rudnicky
Abstract
In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Style Modeling, Transfer And Prediction In Text-to-speech Synthesis Via Phone-level Content-style Disentanglement (2020)9.41
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Interpretable Style Transfer For Text-to-speech With Controlvae And Diffusion Bridge (2023)5.24
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- Expressive Text-to-speech Using Style Tag (2021)10.85
- Stylefusion TTS: Multimodal Style-control And Enhanced Feature Fusion For Zero-shot Text-to-speech Synthesis (2024)6.34
- Style Tokens: Unsupervised Style Modeling, Control And Transfer In End-to-end Speech Synthesis (2018)0.00