Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice
2018 Β· Yan Deng, Lei He, Frank Soong
Abstract
Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization. A multi-speaker neural TTS model is built with the embedded speaker information in both spectral and speaker latent space. The experimental results show that, with less than 5 minutes of training data from a new speaker, the new model can achieve an MOS score of 4.16 in naturalness and 4.64 in speaker similarity close to human recordings (4.74). For a well-trained premium voice, we can achieve an MOS score of 4.5 for out-of-domain texts, which is comparable to an MOS of 4.58 for professional recordings, and significantly outperforms single speaker result of 4.28.
Authors
(none)
Tags
Stats
Related papers
- Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling (2022)3.58
- Combining Speakers Of Multiple Languages To Improve Quality Of Neural Voices (2021)5.24
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Ganspeech: Adversarial Training For High-fidelity Multi-speaker Speech Synthesis (2021)10.07
- Adapting TTS Models For New Speakers Using Transfer Learning (2021)0.00