Joint Optimization Of Streaming And Non-streaming Automatic Speech Recognition With Multi-decoder And Knowledge Distillation
2024 Β· Muhammad Shakeel, Yui Sudo, Yifan Peng, et al.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.
Authors
(none)
Tags
Stats
Related papers
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- Reducing The Gap Between Streaming And Non-streaming Transducer-based ASR By Adaptive Two-stage Knowledge Distillation (2023)4.52
- Knowledge Distillation From Non-streaming To Streaming ASR Encoder Using Auxiliary Non-streaming Layer (2023)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Improving Streaming Automatic Speech Recognition With Non-streaming Model Distillation On Unsupervised Data (2020)0.00
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50