An Investigation Of Time-frequency Representation Discriminators For High-fidelity Vocoder
2024 Β· Yicheng Gu, Xueyao Zhang, Liumeng Xue, et al.
Abstract
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, an
Authors
(none)
Tags
Stats
Related papers
- Multi-scale Sub-band Constant-q Transform Discriminator For High-fidelity Vocoder (2023)0.00
- TFGAN: Time And Frequency Domain Based Generative Adversarial Network For High-fidelity Speech Synthesis (2020)0.00
- A Multi-scale Time-frequency Spectrogram Discriminator For Gan-based Non-autoregressive TTS (2022)6.77
- Vnet: A Gan-based Multi-tier Discriminator Network For Speech Synthesis Vocoders (2024)2.26
- Vocgan: A High-fidelity Real-time Vocoder With A Hierarchically-nested Adversarial Network (2020)12.54
- Vocos: Closing The Gap Between Time-domain And Fourier-based Neural Vocoders For High-quality Audio Synthesis (2023)6.10
- Avocodo: Generative Adversarial Network For Artifact-free Vocoder (2022)9.41
- Bemaganv2: Discriminator Combination Strategies For Gan-based Vocoders In Long-term Audio Generation (2025)2.68