Non-autoregressive Neural Text-to-speech
2019 Β· Kainan Peng, Wei Ping, Zhao Song, et al.
Abstract
In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.
Authors
(none)
Tags
Stats
Related papers
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Parallel Wavenet Conditioned On VAE Latent Vectors (2020)0.00
- Parallel Tacotron: Non-autoregressive And Controllable TTS (2020)12.54
- Portaspeech: Portable And High-quality Generative Text-to-speech (2021)0.00
- GELP: Gan-excited Linear Prediction For Speech Synthesis From Mel-spectrogram (2019)10.74
- VAENAR-TTS: Variational Auto-encoder Based Non-autoregressive Text-to-speech Synthesis (2021)7.50
- Paraformer: Fast And Accurate Parallel Transformer For Non-autoregressive End-to-end Speech Recognition (2022)15.10
- Talknet 2: Non-autoregressive Depth-wise Separable Convolutional Model For Speech Synthesis With Explicit Pitch And Duration Prediction (2021)0.00