DSPGAN: A Gan-based Universal Vocoder For High-fidelity TTS By Time-frequency Domain Supervision From DSP
2022 Β· Kun Song, Yongmao Zhang, Yi Lei, et al.
Abstract
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase and the predicted spectrograms in the inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder.
Authors
(none)
Tags
Stats
Related papers
- TFGAN: Time And Frequency Domain Based Generative Adversarial Network For High-fidelity Speech Synthesis (2020)0.00
- Vocgan: A High-fidelity Real-time Vocoder With A Hierarchically-nested Adversarial Network (2020)12.54
- Bigvgan: A Universal Neural Vocoder With Large-scale Training (2022)6.17
- Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis (2024)5.24
- Snakegan: A Universal Vocoder Leveraging DDSP Prior Knowledge And Periodic Inductive Bias (2023)4.52
- Vnet: A Gan-based Multi-tier Discriminator Network For Speech Synthesis Vocoders (2024)2.26
- A Multi-scale Time-frequency Spectrogram Discriminator For Gan-based Non-autoregressive TTS (2022)6.77
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77