Improving Performance Of Seen And Unseen Speech Style Transfer In End-to-end Neural TTS
2021 Β· Xiaochun An, Frank K. Soong, Lei Xie
Abstract
End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to "fool" a well-trained discriminator; 3) a style distortion loss t
Authors
(none)
Tags
Stats
Related papers
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling (2022)3.58
- Multi-reference Neural TTS Stylization With Adversarial Cycle Consistency (2019)9.03
- Cross-speaker Style Transfer With Prosody Bottleneck In Neural Speech Synthesis (2021)10.21
- Exploring Synthetic Data For Cross-speaker Style Transfer In Style Representation Based TTS (2024)0.00
- Interpretable Style Transfer For Text-to-speech With Controlvae And Diffusion Bridge (2023)5.24
- Learning Latent Representations For Style Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Improving Data Augmentation-based Cross-speaker Style Transfer For TTS With Singing Voice, Style Filtering, And F0 Matching (2024)0.00