Referee: Towards Reference-free Cross-speaker Style Transfer With Low-quality Data For Expressive Speech Synthesis
2021 Β· Songxiang Liu, Shan Yang, Dan Su, et al.
Abstract
Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S model by only using readily accessible low-quality data. The S2W model is trained with high-quality t
Authors
(none)
Tags
Stats
Related papers
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- Exploring Synthetic Data For Cross-speaker Style Transfer In Style Representation Based TTS (2024)0.00
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Style-label-free: Cross-speaker Style Transfer By Quantized VAE And Speaker-wise Normalization In Speech Synthesis (2022)4.52
- CALM: Contrastive Cross-modal Speaking Style Modeling For Expressive Text-to-speech Synthesis (2023)6.77
- Improving Data Augmentation-based Cross-speaker Style Transfer For TTS With Singing Voice, Style Filtering, And F0 Matching (2024)0.00
- Improving Prosody For Cross-speaker Style Transfer By Semi-supervised Style Extractor And Hierarchical Modeling In Speech Synthesis (2023)7.50