JETS: Jointly Training Fastspeech2 And Hifi-gan For End To End Text To Speech
2022 Β· Dan Lim, Sunghee Jung, Eesung Kim
Abstract
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment lea
Authors
(none)
Tags
Stats
Related papers
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Fastspeech 2: Fast And High-quality End-to-end Text To Speech (2020)0.00
- Fast And Small Footprint Hybrid Hmm-hifigan Based System For Speech Synthesis In Indian Languages (2023)0.00
- FPETS : Fully Parallel End-to-end Text-to-speech System (2018)4.52
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Delightfultts 2: End-to-end Speech Synthesis With Adversarial Vector-quantized Auto-encoders (2022)9.23
- End-to-end Adversarial Text-to-speech (2020)0.00