Waveform Generation For Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks
2018 Β· Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, et al.
Abstract
The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.
Authors
(none)
Tags
Stats
Related papers
- Generative Adversarial Network-based Glottal Waveform Model For Statistical Parametric Speech Synthesis (2019)10.35
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77
- TFGAN: Time And Frequency Domain Based Generative Adversarial Network For High-fidelity Speech Synthesis (2020)0.00
- Parallel Wavegan: A Fast Waveform Generation Model Based On Generative Adversarial Networks With Multi-resolution Spectrogram (2019)0.00
- Voice Command Generation Using Progressive Wavegans (2019)0.00
- Parallel Waveform Synthesis Based On Generative Adversarial Networks With Voicing-aware Conditional Discriminators (2020)7.16
- End-to-end Video-to-speech Synthesis Using Generative Adversarial Networks (2021)11.58