Hifi-wavegan: Generative Adversarial Network With Auxiliary Spectrogram-phase Loss For High-fidelity Singing Voice Generation
2022 Β· Chunhui Wang, Chang Zeng, Jun Chen, et al.
Abstract
Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform. Additionally, an auxiliary spectrogram-phase loss is utilized to approximate the real distribution further. The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.
Authors
(none)
Tags
Stats
Related papers
- Hifi-gan: Generative Adversarial Networks For Efficient And High Fidelity Speech Synthesis (2020)0.00
- Singgan: Generative Adversarial Network For High-fidelity Singing Voice Generation (2021)10.61
- Improving Adversarial Waveform Generation Based Singing Voice Conversion With Harmonic Signals (2022)7.50
- Xiaoicesing 2: A High-fidelity Singing Voice Synthesizer Based On Generative Adversarial Network (2022)0.00
- Instructsing: High-fidelity Singing Voice Generation Via Instructing Yourself (2024)0.00
- Mandarin Singing Voice Synthesis With Denoising Diffusion Probabilistic Wasserstein GAN (2022)6.34
- Source-filter-based Generative Adversarial Neural Vocoder For High Fidelity Speech Synthesis (2023)3.58
- Hifi++: A Unified Framework For Bandwidth Extension And Speech Enhancement (2022)11.93