Deep Contextualized Acoustic Representations For Semi-supervised Speech Recognition
2019 Β· Shaoshi Ling, Yuzong Liu, Julian Salazar, et al.
Abstract
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.
Authors
(none)
Tags
Stats
Related papers
- End-to-end ASR: From Supervised To Semi-supervised Learning With Modern Architectures (2019)0.00
- Training ASR Models By Generation Of Contextual Information (2019)0.00
- Semi-supervised Sequence-to-sequence ASR Using Unpaired Speech And Text (2019)0.00
- Large Scale Weakly And Semi-supervised Learning For Low-resource Video ASR (2020)0.00
- Ctc-segmentation Of Large Corpora For German End-to-end Speech Recognition (2020)12.93
- Deep Context: End-to-end Contextual Speech Recognition (2018)15.57
- Bigssl: Exploring The Frontier Of Large-scale Semi-supervised Learning For Automatic Speech Recognition (2021)15.73
- Joint Masked CPC And CTC Training For ASR (2020)8.60