Learning From Flawed Data: Weakly Supervised Automatic Speech Recognition
2023 Β· Dongji Gao, Hainan Xu, Desh Raj, et al.
Abstract
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.
Authors
(none)
Tags
Stats
Code
Related papers
- Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition With Imperfect Transcripts (2023)7.50
- Unsupervised Online Continual Learning For Automatic Speech Recognition (2024)4.52
- From Weak Labels To Strong Results: Utilizing 5,000 Hours Of Noisy Classroom Transcripts With Minimal Accurate Data (2025)0.00
- Joint Masked CPC And CTC Training For ASR (2020)8.60
- Continual Learning For Monolingual End-to-end Automatic Speech Recognition (2021)7.16
- Weakly-supervised Speech Pre-training: A Case Study On Target Speech Recognition (2023)8.09
- Unpaired Speech Enhancement By Acoustic And Adversarial Supervision For Speech Recognition (2018)10.21
- Training ASR Models By Generation Of Contextual Information (2019)0.00