Wave-tacotron: Spectrogram-free End-to-end Text-to-speech Synthesis
2020 Β· Ron J. Weiss, Rj Skerry-Ryan, Eric Battenberg, et al.
Abstract
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform sa
Authors
(none)
Tags
Stats
Related papers
- Tacotron: Towards End-to-end Speech Synthesis (2017)0.00
- Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions (2017)24.07
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Parallel Tacotron: Non-autoregressive And Controllable TTS (2020)12.54
- Flowtron: An Autoregressive Flow-based Generative Network For Text-to-speech Synthesis (2020)5.91
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Wavenet: A Generative Model For Raw Audio (2016)0.00
- Waveform Generation For Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks (2018)8.35