Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR
2020 Β· Xiaohui Zhang, Frank Zhang, Chunxi Liu, et al.
Abstract
In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identical datasets and encoder model architecture. We find that RNN-T has consistent wins in ASR accuracy, while CTC models excel at inference efficiency. Moreover, we selectively examine various modeling strategies for different training criteria, including modeling units, encoder architectures, pre-training, etc. Given such large-scale real-world streaming ASR application, to our best knowledge, we present the first comprehensive benchmark on these three widely used training criteria across a great many languages.
Authors
(none)
Tags
Stats
Related papers
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- A Comparison Of Semi-supervised Learning Techniques For Streaming ASR At Scale (2023)2.26
- An Investigation Of Monotonic Transducers For Large-scale Automatic Speech Recognition (2022)6.77
- Open ASR Leaderboard: Towards Reproducible And Transparent Multilingual And Long-form Speech Recognition Evaluation (2025)0.00
- One In A Hundred: Select The Best Predicted Sequence From Numerous Candidates For Streaming Speech Recognition (2020)0.00
- CUSIDE-T: Chunking, Simulating Future And Decoding For Transducer Based Streaming ASR (2024)2.26
- Improving RNN Transducer Based ASR With Auxiliary Tasks (2020)9.59