End-to-end Adversarial Text-to-speech
2020 · Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, et al.
Abstract
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion s
Authors
(none)
Tags
Stats
Related papers
- Generative Adversarial Training For Text-to-speech Synthesis Based On Raw Phonetic Input And Explicit Prosody Modelling (2023)3.58
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- End-to-end Video-to-speech Synthesis Using Generative Adversarial Networks (2021)11.58
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Adversarial Learning Of Intermediate Acoustic Feature For End-to-end Lightweight Text-to-speech (2022)0.00
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77