Knowledge Distillation From Non-streaming To Streaming ASR Encoder Using Auxiliary Non-streaming Layer
2023 Β· Kyuhong Shim, Jinkyu Lee, Simyung Chang, et al.
Abstract
Streaming automatic speech recognition (ASR) models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied, mainly focusing on aligning the output token probabilities. In this paper, we propose a layer-to-layer KD from the teacher encoder to the student encoder. To ensure that features are extracted using the same context, we insert auxiliary non-streaming branches to the student and perform KD from the non-streaming teacher layer to the non-streaming auxiliary layer. We design a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts. Experimental results show that the proposed method can significantly reduce the word error rate compared to previous token probability distillation methods.
Authors
(none)
Tags
Stats
Related papers
- Reducing The Gap Between Streaming And Non-streaming Transducer-based ASR By Adaptive Two-stage Knowledge Distillation (2023)4.52
- Joint Optimization Of Streaming And Non-streaming Automatic Speech Recognition With Multi-decoder And Knowledge Distillation (2024)0.00
- Improving Streaming Automatic Speech Recognition With Non-streaming Model Distillation On Unsupervised Data (2020)0.00
- Leave No Knowledge Behind During Knowledge Distillation: Towards Practical And Effective Knowledge Distillation For Code-switching ASR Using Realistic Data (2024)3.58
- Sequence-level Knowledge Distillation For Class-incremental End-to-end Spoken Language Understanding (2023)0.00
- Inter-kd: Intermediate Knowledge Distillation For Ctc-based Automatic Speech Recognition (2022)7.50
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58
- Synergistic Effects Of Knowledge Distillation And Structured Pruning For Self-supervised Speech Models (2025)0.00