Wavehax: Aliasing-free Neural Waveform Synthesis Based On 2D Convolution And Harmonic Prior For Reliable Complex Spectrogram Estimation
2024 Β· Reo Yoneyama, Atsushi Miyashita, Ryuichi Yamamoto, et al.
Abstract
Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an
Authors
(none)
Tags
Stats
Related papers
- Wavecyclegan2: Time-domain Neural Post-filter For Speech Waveform Generation (2019)0.00
- A Neural Vocoder With Hierarchical Generation Of Amplitude And Phase Spectra For Statistical Parametric Speech Synthesis (2019)10.74
- Vocos: Closing The Gap Between Time-domain And Fourier-based Neural Vocoders For High-quality Audio Synthesis (2023)6.10
- Neuraldps: Neural Deterministic Plus Stochastic Model With Multiband Excitation For Noise-controllable Waveform Generation (2022)0.00
- Mathematical Vocoder Algorithm : Modified Spectral Inversion For Efficient Neural Speech Synthesis (2021)0.00
- Wavecyclegan: Synthetic-to-natural Speech Waveform Conversion Using Cycle-consistent Adversarial Networks (2018)9.92
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- Puffin: Pitch-synchronous Neural Waveform Generation For Fullband Speech On Modest Devices (2022)3.58