Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling
2022 Β· Tuomo Raitio, Javier Latorre, Andrea Davis, et al.
Abstract
Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the overall TTS quality, 2) the proposed MSMS approach outperforms pre-training and fine-tuning approach when utilizing additional multi-speaker data, and 3) long-form speaking style is highly rated regardless of the target text domain.
Authors
(none)
Tags
Stats
Related papers
- Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice (2018)0.00
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Improving Performance Of Seen And Unseen Speech Style Transfer In End-to-end Neural TTS (2021)6.34
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09
- Combining Speakers Of Multiple Languages To Improve Quality Of Neural Voices (2021)5.24
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29