Zero Shot Text To Speech Augmentation For Automatic Speech Recognition On Low-resource Accented Speech Corpora
2024 Β· Francesco Nespoli, Daniel Barreda, Patrick A. Naylor
Abstract
In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up to 5% word error rate reduction (WERR). In conclusion, we demonstrate that by incorporating a modest f
Authors
(none)
Tags
Stats
Related papers
- ASR Data Augmentation In Low-resource Settings Using Cross-lingual Multi-speaker TTS And Cross-lingual Voice Conversion (2022)6.77
- Frustratingly Easy Data Augmentation For Low-resource ASR (2025)0.00
- Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-speech Synthesis (2024)0.00
- Improving Low Resource Code-switched ASR Using Augmented Code-switched TTS (2020)7.50
- Speech Synthesis As Augmentation For Low-resource ASR (2020)0.00
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- Synthetic Cross-accent Data Augmentation For Automatic Speech Recognition (2023)0.00
- Pretraining By Backtranslation For End-to-end ASR In Low-resource Settings (2018)0.00