Msdtron: A High-capability Multi-speaker Speech Synthesis System For Diverse Data Using Characteristic Information
2021 Β· Qinghua Wu, Quanbo Shen, Jian Luan, et al.
Abstract
In multi-speaker speech synthesis, data from a number of speakers usually tend to have great diversity due to the fact that the speakers may differ largely in ages, speaking styles, emotions, and so on. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper proposes a high-capability speech synthesis system, called Msdtron, in which 1) a representation of the harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram. 2) conditional gated LSTM (CGLSTM) is proposed to control the flow of text content information through the network by re-weighting the gates of LSTM using speaker information. The experiments show a significant reduction in reconstruction error of mel-spectrogram in the training of the multi-speaker model, and a great improvement is observed in the subjective evaluation of speaker adapted model.
Authors
(none)
Tags
Stats
Related papers
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Speaker Verification-derived Loss And Data Augmentation For Dnn-based Multispeaker Speech Synthesis (2021)3.58
- Multi-spectrogan: High-diversity And High-fidelity Spectrogram Generation With Adversarial Style Combination For Speech Synthesis (2020)0.00
- Property-aware Multi-speaker Data Simulation: A Probabilistic Modelling Technique For Synthetic Data Generation (2023)6.34
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09
- Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice (2018)0.00
- ELF: Encoding Speaker-specific Latent Speech Feature For Speech Synthesis (2023)0.00
- Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling (2022)3.58