Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-speech Synthesis
2024 Β· Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, et al.
Abstract
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the E
Authors
(none)
Tags
Stats
Related papers
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- Tts-by-tts: Tts-driven Data Augmentation For Fast And High-quality Speech Synthesis (2020)9.59
- Generating Synthetic Audio Data For Attention-based Speech Recognition Systems (2019)12.68
- Training Data Augmentation For Dysarthric Automatic Speech Recognition By Text-to-dysarthric-speech Synthesis (2024)10.48
- Synthetic Cross-accent Data Augmentation For Automatic Speech Recognition (2023)0.00
- Speech Recognition With Augmented Synthesized Speech (2019)13.97
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29
- Zero Shot Text To Speech Augmentation For Automatic Speech Recognition On Low-resource Accented Speech Corpora (2024)2.26