Estvocoder: An Excitation-spectral-transformed Neural Vocoder Conditioned On Mel Spectrogram
2024 Β· Xiao-Hang Jiang, Hui-Peng Du, Yang Ai, et al.
Abstract
This paper proposes ESTVocoder, a novel excitation-spectral-transformed neural vocoder within the framework of source-filter theory. The ESTVocoder transforms the amplitude and phase spectra of the excitation into the corresponding speech amplitude and phase spectra using a neural filter whose backbone is ConvNeXt v2 blocks. Finally, the speech waveform is reconstructed through the inverse short-time Fourier transform (ISTFT). The excitation is constructed based on the F0: for voiced segments, it contains full harmonic information, while for unvoiced segments, it is represented by noise. The excitation provides the filter with prior knowledge of the amplitude and phase patterns, expecting to reduce the modeling difficulty compared to conventional neural vocoders. To ensure the fidelity of the synthesized speech, an adversarial training strategy is applied to ESTVocoder with multi-scale and multi-resolution discriminators. Analysis-synthesis and text-to-speech experiments both confirm t
Authors
(none)
Tags
Stats
Related papers
- Excitnet Vocoder: A Neural Excitation Model For Parametric Speech Synthesis Systems (2018)9.76
- A Neural Denoising Vocoder For Clean Waveform Generation From Noisy Mel-spectrogram Based On Amplitude And Phase Predictions (2024)0.00
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- Bivocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction And Waveform Generation (2024)5.84
- Istftnet: Fast And Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform (2022)13.34
- Neuraldps: Neural Deterministic Plus Stochastic Model With Multiband Excitation For Noise-controllable Waveform Generation (2022)0.00
- Mathematical Vocoder Algorithm : Modified Spectral Inversion For Efficient Neural Speech Synthesis (2021)0.00
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00