Multi-speaker Multi-style Speech Synthesis With Timbre And Style Disentanglement
2022 Β· Wei Song, Yanghao Yue, Ya-Jie Zhang, et al.
Abstract
Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and ene
Authors
(none)
Tags
Stats
Related papers
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Fine-grained Style Modeling, Transfer And Prediction In Text-to-speech Synthesis Via Phone-level Content-style Disentanglement (2020)9.41
- Genertts: Pronunciation Disentanglement For Timbre And Style Generalization In Cross-lingual Text-to-speech (2023)2.26
- Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder (2024)2.26
- Multi-speaker Expressive Speech Synthesis Via Multiple Factors Decoupling (2022)0.00
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Boosting Multi-speaker Expressive Speech Synthesis With Semi-supervised Contrastive Learning (2023)5.24
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23