Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition
2020 Β· Binbin Zhang, di Wu, Zhuoyuan Yao, et al.
Abstract
In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streamin
Authors
(none)
Tags
Stats
Related papers
- Enhancing The Unified Streaming And Non-streaming Model With Contrastive Learning (2023)3.58
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- U2++: Unified Two-pass Bidirectional End-to-end Model For Speech Recognition (2021)0.00
- Fast-u2++: Fast And Accurate End-to-end Speech Recognition In Joint Ctc/attention Frames (2022)5.84
- Online Hybrid Ctc/attention End-to-end Automatic Speech Recognition Architecture (2023)12.99
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Unified End-to-end Speech Recognition And Endpointing For Fast And Efficient Speech Systems (2022)5.24