Generative Speech Foundation Model Pretraining For High-quality Speech Extraction And Restoration
2024 Β· Pin-Jui Ku, Alexander H. Liu, Roman Korostik, et al.
Abstract
This paper proposes a generative pretraining foundation model for high-quality speech restoration tasks. By directly operating on complex-valued short-time Fourier transform coefficients, our model does not rely on any vocoders for time-domain signal reconstruction. As a result, our model simplifies the synthesis process and removes the quality upper-bound introduced by any mel-spectrogram vocoder compared to prior work SpeechFlow. The proposed method is evaluated on multiple speech restoration tasks, including speech denoising, bandwidth extension, codec artifact removal, and target speaker extraction. In all scenarios, finetuning our pretrained model results in superior performance over strong baselines. Notably, in the target speaker extraction task, our model outperforms existing systems, including those leveraging SSL-pretrained encoders like WavLM. The code and the pretrained checkpoints are publicly available in the NVIDIA NeMo framework.
Authors
(none)
Tags
Stats
Related papers
- Generative Pre-training For Speech With Flow Matching (2023)0.00
- Voicerestore: Flow-matching Transformers For Speech Recording Quality Restoration (2025)0.00
- Voicefixer: A Unified Framework For High-fidelity Speech Restoration (2022)12.33
- Real-time Streamable Generative Speech Restoration With Flow Matching (2025)0.00
- Improved Normalizing Flow-based Speech Enhancement Using An All-pole Gammatone Filterbank For Conditional Input Representation (2022)0.00
- A Neural Denoising Vocoder For Clean Waveform Generation From Noisy Mel-spectrogram Based On Amplitude And Phase Predictions (2024)0.00
- Enhancing Low-quality Voice Recordings Using Disentangled Channel Factor And Neural Waveform Model (2020)0.00
- Speech Denoising By Parametric Resynthesis (2019)7.16