Comparing The Benefit Of Synthetic Training Data For Various Automatic Speech Recognition Architectures
2021 Β· Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, et al.
Abstract
Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems
Authors
(none)
Tags
Stats
Related papers
- On The Effect Of Purely Synthetic Training Data For Different Automatic Speech Recognition Architectures (2024)0.00
- Generating Synthetic Audio Data For Attention-based Speech Recognition Systems (2019)12.68
- On The Problem Of Text-to-speech Model Selection For Synthetic Data Generation In Automatic Speech Recognition (2024)4.52
- On The Relevance Of Phoneme Duration Variability Of Synthesized Training Data For Automatic Speech Recognition (2023)2.26
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- Intermediate Fine-tuning Using Imperfect Synthetic Speech For Improving Electrolaryngeal Speech Recognition (2022)0.00
- Training Data Augmentation For Dysarthric Automatic Speech Recognition By Text-to-dysarthric-speech Synthesis (2024)10.48
- Enhancing Synthetic Training Data For Speech Commands: From Asr-based Filtering To Domain Adaptation In SSL Latent Space (2024)0.00