Back-translation-style Data Augmentation For End-to-end ASR
2018 Β· Tomoki Hayashi, Shinji Watanabe, Yu Zhang, et al.
Abstract
In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demon
Authors
(none)
Tags
Stats
Related papers
- Improving Code-switching And Named Entity Recognition In ASR With Speech Editing Based Data Augmentation (2023)6.34
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- Generating Synthetic Audio Data For Attention-based Speech Recognition Systems (2019)12.68
- Data Augmentation Methods For End-to-end Speech Recognition On Distant-talk Scenarios (2021)6.34
- Pre-training End-to-end ASR Models With Augmented Speech Samples Queried By Text (2023)0.00
- Pretraining By Backtranslation For End-to-end ASR In Low-resource Settings (2018)0.00
- On-the-fly Aligned Data Augmentation For Sequence-to-sequence ASR (2021)9.23
- Skinaugment: Auto-encoding Speaker Conversions For Automatic Speech Translation (2020)7.16