Multi-speaker Expressive Speech Synthesis Via Multiple Factors Decoupling
2022 Β· Xinfa Zhu, Yi Lei, Kun Song, et al.
Abstract
This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel trans
Authors
(none)
Tags
Stats
Related papers
- Boosting Multi-speaker Expressive Speech Synthesis With Semi-supervised Contrastive Learning (2023)5.24
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Cross-speaker Emotion Disentangling And Transfer For End-to-end Speech Synthesis (2021)12.61
- Cross-speaker Emotion Transfer Based On Speaker Condition Layer Normalization And Semi-supervised Training In Text-to-speech (2021)0.00
- Multi-speaker Multi-style Speech Synthesis With Timbre And Style Disentanglement (2022)6.77
- Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder (2024)2.26