Training ASR Models By Generation Of Contextual Information
2019 Β· Kritika Singh, Dmytro Okhonko, Jun Liu, et al.
Abstract
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder aco
Authors
(none)
Tags
Stats
Related papers
- Large Scale Weakly And Semi-supervised Learning For Low-resource Video ASR (2020)0.00
- Improving RNN-T ASR Accuracy Using Context Audio (2020)5.84
- Deep Contextualized Acoustic Representations For Semi-supervised Speech Recognition (2019)14.62
- From Weak Labels To Strong Results: Utilizing 5,000 Hours Of Noisy Classroom Transcripts With Minimal Accurate Data (2025)0.00
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Learning Asr-robust Contextualized Embeddings For Spoken Language Understanding (2019)12.02
- A Comparison Of Semi-supervised Learning Techniques For Streaming ASR At Scale (2023)2.26
- Improving RNN Transducer Based ASR With Auxiliary Tasks (2020)9.59