Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models
2021 Β· Thibault Doutre, Wei Han, Chung-Cheng Chiu, et al.
Abstract
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest tea
Authors
(none)
Tags
Stats
Related papers
- Improving Streaming Automatic Speech Recognition With Non-streaming Model Distillation On Unsupervised Data (2020)0.00
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- One In A Hundred: Select The Best Predicted Sequence From Numerous Candidates For Streaming Speech Recognition (2020)0.00
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00