Beyond Oversmoothing: Evaluating DDPM And MSE For Scalable Speech Synthesis In ASR
2024 Β· Christoph Minixhofer, Ondrej Klejch, Peter Bell
Abstract
Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work, we explore whether this phenomenon is due to the oversmoothing behaviour of models commonly used in TTS, with a particular focus on the behaviour of TTS-for-ASR as the amount of TTS training data is scaled up. We systematically compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We test the scalability of the two approaches, varying both the number hours, and the number of different speakers. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.
Authors
(none)
Tags
Stats
Related papers
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Training Data Augmentation For Dysarthric Automatic Speech Recognition By Text-to-dysarthric-speech Synthesis (2024)10.48
- DENT-DDSP: Data-efficient Noisy Speech Generator Using Differentiable Digital Signal Processors For Explicit Distortion Modelling And Noise-robust Speech Recognition (2022)0.00
- On The Effect Of Purely Synthetic Training Data For Different Automatic Speech Recognition Architectures (2024)0.00
- Effect Of Noise Suppression Losses On Speech Distortion And ASR Performance (2021)10.74
- Comparing The Benefit Of Synthetic Training Data For Various Automatic Speech Recognition Architectures (2021)5.24
- Dmospeech: Direct Metric Optimization Via Distilled Diffusion Model In Zero-shot Speech Synthesis (2024)0.00
- BDDM: Bilateral Denoising Diffusion Models For Fast And High-quality Speech Synthesis (2022)4.76