Adversarial Learning Of Intermediate Acoustic Feature For End-to-end Lightweight Text-to-speech
2022 Β· Hyungchan Yoon, Seyun Um, Changwhan Kim, et al.
Abstract
To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding *prosody embeddings* to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments
Authors
(none)
Tags
Stats
Related papers
- End-to-end Adversarial Text-to-speech (2020)0.00
- Dynamic Prosody Generation For Speech Synthesis Using Linguistics-driven Acoustic Embedding Selection (2019)7.81
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Multi-spectrogan: High-diversity And High-fidelity Spectrogram Generation With Adversarial Style Combination For Speech Synthesis (2020)0.00
- Fine-grained Robust Prosody Transfer For Single-speaker Neural Text-to-speech (2019)0.00
- Generative Adversarial Training For Text-to-speech Synthesis Based On Raw Phonetic Input And Explicit Prosody Modelling (2023)3.58