Combining Speakers Of Multiple Languages To Improve Quality Of Neural Voices
2021 Β· Javier Latorre, Charlotte Bailleul, Tuuli Morrill, et al.
Abstract
In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than \(40%\) of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within \(80%\) of native single-speaker models, in terms of Mean Opinion Score.
Authors
(none)
Tags
Stats
Related papers
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice (2018)0.00
- Learning To Speak Fluently In A Foreign Language: Multilingual Speech Synthesis And Cross-language Voice Cloning (2019)15.03
- Building A Mixed-lingual Neural TTS System With Only Monolingual Data (2019)0.00
- Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling (2022)3.58
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Cross-lingual Text-to-speech Using Multi-task Learning And Speaker Classifier Joint Training (2022)0.00