Cross-lingual Text-to-speech Using Multi-task Learning And Speaker Classifier Joint Training
2022 Β· J. Yang, Lei He
Abstract
In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set.
Authors
(none)
Tags
Stats
Related papers
- Learning To Speak Fluently In A Foreign Language: Multilingual Speech Synthesis And Cross-language Voice Cloning (2019)15.03
- Combining Speakers Of Multiple Languages To Improve Quality Of Neural Voices (2021)5.24
- Cross-lingual Multi-speaker Text-to-speech Synthesis For Voice Cloning Without Using Parallel Corpus For Unseen Speakers (2019)0.00
- The THU-HCSI Multi-speaker Multi-lingual Few-shot Voice Cloning System For LIMMITS'24 Challenge (2024)0.00
- ERNIE-SAT: Speech And Text Joint Pretraining For Cross-lingual Multi-speaker Text-to-speech (2022)0.00
- Using Joint Training Speaker Encoder With Consistency Loss To Achieve Cross-lingual Voice Conversion And Expressive Voice Conversion (2023)0.00
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09