Istftnet: Fast And Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
2022 Β· Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, et al.
Abstract
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with th
Authors
(none)
Tags
Stats
Related papers
- Hiftnet: A Fast High-quality Neural Vocoder With Harmonic-plus-noise Filter And Inverse Short Time Fourier Transform (2023)0.00
- Fastfit: Towards Real-time Iterative Neural Vocoder By Replacing U-net Encoder With Multiple Stfts (2023)0.00
- Estvocoder: An Excitation-spectral-transformed Neural Vocoder Conditioned On Mel Spectrogram (2024)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- A Neural Denoising Vocoder For Clean Waveform Generation From Noisy Mel-spectrogram Based On Amplitude And Phase Predictions (2024)0.00
- Quickvc: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform For Faster Conversion (2023)0.00
- Bivocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction And Waveform Generation (2024)5.84