Chunked Autoregressive GAN For Conditional Waveform Synthesis
2021 Β· Max Morrison, Rithesh Kumar, Kundan Kumar, et al.
Abstract
Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the wa
Authors
(none)
Tags
Stats
Related papers
- Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis (2019)0.00
- Conditional Wavegan (2018)4.22
- Adversarial Audio Synthesis (2018)0.00
- Gansynth: Adversarial Neural Audio Synthesis (2019)0.00
- Waveform Generation For Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks (2018)8.35
- Generative Adversarial Network-based Glottal Waveform Model For Statistical Parametric Speech Synthesis (2019)10.35
- Unconditional Audio Generation With Generative Adversarial Networks And Cycle Regularization (2020)9.41
- Parallel Waveform Synthesis Based On Generative Adversarial Networks With Voicing-aware Conditional Discriminators (2020)7.16