Pretraining By Backtranslation For End-to-end ASR In Low-resource Settings
2018 Β· Matthew Wiesner, Adithya Renduchintala, Shinji Watanabe, et al.
Abstract
We explore training attention-based encoder-decoder ASR in low-resource settings. These models perform poorly when trained on small amounts of transcribed speech, in part because they depend on having sufficient target-side text to train the attention and decoder networks. In this paper we address this shortcoming by pretraining our network parameters using only text-based data and transcribed speech from other languages. We analyze the relative contributions of both sources of data. Across 3 test languages, our text-based approach resulted in a 20% average relative improvement over a text-based augmentation technique without pretraining. Using transcribed speech from nearby languages gives a further 20-30% relative reduction in character error rate.
Authors
(none)
Tags
Stats
Related papers
- Analyzing ASR Pretraining For Low-resource Speech-to-text Translation (2019)10.07
- Strategies For Improving Low Resource Speech To Text Translation Relying On Pre-trained ASR Models (2023)5.24
- Leveraging Translations For Speech Transcription In Low-resource Settings (2018)6.77
- Improving Cross-lingual Transfer Learning For End-to-end Speech Recognition With Speech Translation (2020)9.92
- Pre-training End-to-end ASR Models With Augmented Speech Samples Queried By Text (2023)0.00
- ASR Data Augmentation In Low-resource Settings Using Cross-lingual Multi-speaker TTS And Cross-lingual Voice Conversion (2022)6.77
- Back-translation-style Data Augmentation For End-to-end ASR (2018)13.11
- Frustratingly Easy Data Augmentation For Low-resource ASR (2025)0.00