Effective Parameter Estimation Methods For An Excitnet Model In Generative Text-to-speech Systems
2019 Β· Ohsung Kwon, Eunwoo Song, Jae-Min Kim, et al.
Abstract
In this paper, we propose a high-quality generative text-to-speech (TTS) system using an effective spectrum and excitation estimation method. Our previous research verified the effectiveness of the ExcitNet-based speech generation model in a parametric TTS framework. However, the challenge remains to build a high-quality speech synthesis system because auxiliary conditional features estimated by a simple deep neural network often contain large prediction errors, and the errors are inevitably propagated throughout the autoregressive generation process of the ExcitNet vocoder. To generate more natural speech signals, we exploited a sequence-to-sequence (seq2seq) acoustic model with an attention-based generative network (e.g., Tacotron 2) to estimate the condition parameters of the ExcitNet vocoder. Because the seq2seq acoustic model accurately estimates spectral parameters, and because the ExcitNet model effectively generates the corresponding time-domain excitation signals, combining th
Authors
(none)
Tags
Stats
Related papers
- Excitnet Vocoder: A Neural Excitation Model For Parametric Speech Synthesis Systems (2018)9.76
- Towards Parametric Speech Synthesis Using Gaussian-markov Model Of Spectral Envelope And Wavelet-based Decomposition Of F0 (2022)0.00
- Generative Adversarial Network-based Glottal Waveform Model For Statistical Parametric Speech Synthesis (2019)10.35
- Waveform Generation For Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks (2018)8.35
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Vocal Effort Modeling In Neural TTS For Improving The Intelligibility Of Synthetic Speech In Noise (2022)4.52