Using Multiple Reference Audios And Style Embedding Constraints For Speech Synthesis
2021 Β· Cheng Gong, Longbiao Wang, Zhenhua Ling, et al.
Abstract
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ''target'' style embedding. The ex
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Multi-speaker Expressive Speech Synthesis Via Multiple Factors Decoupling (2022)0.00
- Style-preserving Lip Sync Via Audio-aware Style Reference (2024)0.00
- Cross-speaker Style Transfer With Prosody Bottleneck In Neural Speech Synthesis (2021)10.21
- Msstyletts: Multi-scale Style Modeling With Hierarchical Context Information For Expressive Speech Synthesis (2023)6.77
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Improving Performance Of Seen And Unseen Speech Style Transfer In End-to-end Neural TTS (2021)6.34