Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis
2024 Β· Prabhav Agrawal, Thilo Koehler, Zhiping Xiu, et al.
Abstract
Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 3
Authors
(none)
Tags
Stats
Related papers
- DSPGAN: A Gan-based Universal Vocoder For High-fidelity TTS By Time-frequency Domain Supervision From DSP (2022)9.03
- Fast, High-quality And Parameter-efficient Articulatory Synthesis Using Differentiable DSP (2024)2.26
- A Real-time Wideband Neural Vocoder At 1.6 Kb/s Using Lpcnet (2019)12.61
- Neuraldps: Neural Deterministic Plus Stochastic Model With Multiband Excitation For Noise-controllable Waveform Generation (2022)0.00
- Puffin: Pitch-synchronous Neural Waveform Generation For Fullband Speech On Modest Devices (2022)3.58
- Vocgan: A High-fidelity Real-time Vocoder With A Hierarchically-nested Adversarial Network (2020)12.54
- Speech Synthesis And Control Using Differentiable DSP (2020)0.00
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00