Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer
2022 Β· Jingyu Sun, Guiping Zhong, Dinghao Zhou, et al.
Abstract
An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an average latency of 640ms, our model achieves a relative WER reduction of 6.0% on test-clean and 3.0% on tes
Authors
(none)
Tags
Stats
Related papers
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Dynamic Latency Speech Recognition With Asynchronous Revision (2020)0.00
- Stateful Conformer With Cache-based Inference For Streaming Automatic Speech Recognition (2023)8.60
- Dctx-conformer: Dynamic Context Carry-over For Low Latency Unified Streaming And Non-streaming Conformer ASR (2023)2.26
- Minimum Latency Training Of Sequence Transducers For Streaming End-to-end Speech Recognition (2022)0.00
- Streaming Attention-based Models With Augmented Memory For End-to-end Speech Recognition (2020)5.84
- Improving Streaming Speech Recognition With Time-shifted Contextual Attention And Dynamic Right Context Masking (2025)2.26
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77