Fine-grained Style Modeling, Transfer And Prediction In Text-to-speech Synthesis Via Phone-level Content-style Disentanglement
2020 Β· Daxin Tan, Tan Lee
Abstract
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-spe
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Style Control In Transformer-based Text-to-speech Synthesis (2021)11.19
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23
- Multi-speaker Multi-style Speech Synthesis With Timbre And Style Disentanglement (2022)6.77
- Improving Prosody For Cross-speaker Style Transfer By Semi-supervised Style Extractor And Hierarchical Modeling In Speech Synthesis (2023)7.50
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Stylebook: Content-dependent Speaking Style Modeling For Any-to-any Voice Conversion Using Only Speech Data (2023)0.00