Large Scale Weakly And Semi-supervised Learning For Low-resource Video ASR
2020 Β· Kritika Singh, Vimal Manohar, Alex Xiao, et al.
Abstract
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.
Authors
(none)
Tags
Stats
Related papers
- Training ASR Models By Generation Of Contextual Information (2019)0.00
- From Weak Labels To Strong Results: Utilizing 5,000 Hours Of Noisy Classroom Transcripts With Minimal Accurate Data (2025)0.00
- Deep Contextualized Acoustic Representations For Semi-supervised Speech Recognition (2019)14.62
- A Comparison Of Semi-supervised Learning Techniques For Streaming ASR At Scale (2023)2.26
- Improving Streaming Automatic Speech Recognition With Non-streaming Model Distillation On Unsupervised Data (2020)0.00
- Improving Low-resource Speech Recognition With Pretrained Speech Models: Continued Pretraining Vs. Semi-supervised Training (2022)0.00
- End-to-end ASR: From Supervised To Semi-supervised Learning With Modern Architectures (2019)0.00
- Evaluating Standard And Dialectal Frisian ASR: Multilingual Fine-tuning And Language Identification For Improved Low-resource Performance (2025)0.00