Cascaded Encoders For Unifying Streaming And Non-streaming ASR
2020 Β· Arun Narayanan, Tara N. Sainath, Ruoming Pang, et al.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.
Authors
(none)
Tags
Stats
Related papers
- Joint Optimization Of Streaming And Non-streaming Automatic Speech Recognition With Multi-decoder And Knowledge Distillation (2024)0.00
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- On Comparison Of Encoders For Attention Based End To End Speech Recognition In Standalone And Rescoring Mode (2022)2.26
- Toward Streaming ASR With Non-autoregressive Insertion-based Model (2020)4.52
- Large-scale Multilingual Speech Recognition With A Streaming End-to-end Model (2019)14.97
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58