Universal Melgan: A Robust Neural Vocoder For High-fidelity Waveform Generation In Multiple Domains
2020 Β· Won Jang, Dan Lim, Jaesam Yoon
Abstract
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms. This enables the model to generate realistic waveforms of multi-speakers, by alleviating the over-smoothing problem in the high frequency band of the large footprint model. Our structure generates signals close to ground-truth data without reducing the inference speed, by discriminating the waveform and spectrogram during training. The model achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input. Especially, it showed superior performance in unseen domains with regard of speaker, emotion, and language. Moreover, in a multi-speaker text-to-speech scenario using mel-spectrogram generated by a transformer model, it synt
Authors
(none)
Tags
Stats
Related papers
- Vocgan: A High-fidelity Real-time Vocoder With A Hierarchically-nested Adversarial Network (2020)12.54
- Stylemelgan: An Efficient High-fidelity Adversarial Vocoder With Temporal Adaptive Normalization (2020)13.05
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- Melgan-vc: Voice Conversion And Audio Style Transfer On Arbitrarily Long Samples Using Spectrograms (2019)0.00
- Bigvgan: A Universal Neural Vocoder With Large-scale Training (2022)6.17
- Vnet: A Gan-based Multi-tier Discriminator Network For Speech Synthesis Vocoders (2024)2.26
- Relational Data Selection For Data Augmentation Of Speaker-dependent Multi-band Melgan Vocoder (2021)0.00
- DSPGAN: A Gan-based Universal Vocoder For High-fidelity TTS By Time-frequency Domain Supervision From DSP (2022)9.03