Hiftnet: A Fast High-quality Neural Vocoder With Harmonic-plus-noise Filter And Inverse Short Time Fourier Transform
2023 Β· Yinghao Aaron Li, Cong Han, Xilin Jiang, et al.
Abstract
Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only \(
Authors
(none)
Tags
Stats
Related papers
- Hifi-gan: Generative Adversarial Networks For Efficient And High Fidelity Speech Synthesis (2020)0.00
- Hifi-wavegan: Generative Adversarial Network With Auxiliary Spectrogram-phase Loss For High-fidelity Singing Voice Generation (2022)0.00
- Hifi++: A Unified Framework For Bandwidth Extension And Speech Enhancement (2022)11.93
- Istftnet: Fast And Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform (2022)13.34
- Source-filter Hifi-gan: Fast And Pitch Controllable High-fidelity Neural Vocoder (2022)10.74
- Fast And Small Footprint Hybrid Hmm-hifigan Based System For Speech Synthesis In Indian Languages (2023)0.00
- Fastfit: Towards Real-time Iterative Neural Vocoder By Replacing U-net Encoder With Multiple Stfts (2023)0.00
- TFGAN: Time And Frequency Domain Based Generative Adversarial Network For High-fidelity Speech Synthesis (2020)0.00