Genertts: Pronunciation Disentanglement For Timbre And Style Generalization In Cross-lingual Text-to-speech
2023 Β· Yahuan Cong, Haoyu Zhang, Haopeng Lin, et al.
Abstract
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we propose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the mutual information between style and language to discard the language-specific information in the style embedding. The experiments indicate that GenerTTS outperforms baseline systems in terms of style similarity and pronunciation accuracy, and enables cross-lingual timbre and style generalization.
Authors
(none)
Tags
Stats
Related papers
- Multi-speaker Multi-style Speech Synthesis With Timbre And Style Disentanglement (2022)6.77
- Generspeech: Towards Style Transfer For Generalizable Out-of-domain Text-to-speech (2022)5.24
- Generalized Multilingual Text-to-speech Generation With Language-aware Style Adaptation (2025)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Exploring Timbre Disentanglement In Non-autoregressive Cross-lingual Text-to-speech (2021)2.26
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Style Description Based Text-to-speech With Conditional Prosodic Layer Normalization Based Diffusion GAN (2023)0.00
- Exploring Synthetic Data For Cross-speaker Style Transfer In Style Representation Based TTS (2024)0.00