Fastfit: Towards Real-time Iterative Neural Vocoder By Replacing U-net Encoder With Multiple Stfts
2023 Β· Won Jang, Dan Lim, Heayoung Park
Abstract
This paper presents FastFit, a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.
Authors
(none)
Tags
Stats
Related papers
- Hiftnet: A Fast High-quality Neural Vocoder With Harmonic-plus-noise Filter And Inverse Short Time Fourier Transform (2023)0.00
- Istftnet: Fast And Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform (2022)13.34
- Wavefit: An Iterative And Non-autoregressive Neural Vocoder Based On Fixed-point Iteration (2022)9.41
- Bivocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction And Waveform Generation (2024)5.84
- Fastvc: Fast Voice Conversion With Non-parallel Data (2020)5.24
- FLY-TTS: Fast, Lightweight And High-quality End-to-end Text-to-speech Synthesis (2024)0.00
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- Vocos: Closing The Gap Between Time-domain And Fourier-based Neural Vocoders For High-quality Audio Synthesis (2023)6.10