CUSIDE-T: Chunking, Simulating Future And Decoding For Transducer Based Streaming ASR
2024 Β· Wenbo Zhao, Ziwei Li, Chuan Yu, et al.
Abstract
Streaming automatic speech recognition (ASR) is very important for many real-world ASR applications. However, a notable challenge for streaming ASR systems lies in balancing operational performance against latency constraint. Recently, a method of chunking, simulating future context and decoding, called CUSIDE, has been proposed for connectionist temporal classification (CTC) based streaming ASR, which obtains a good balance between reduced latency and high recognition accuracy. In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. We also incorporate language model rescoring in CUSIDE-T to further enhance accuracy, while only bringing a small additional latency. Extensive experiments are conducted over the AISHELL-1, WenetSpeech and SpeechIO datasets, comparing CUSIDE-T and U2++ (both based on RNN-T). U2++ is an existing counterpart of chunk
Authors
(none)
Tags
Stats
Related papers
- CUSIDE: Chunking, Simulating Future Context And Decoding For Streaming ASR (2022)7.50
- Cuside-array: A Streaming Multi-channel End-to-end Speech Recognition System With Realistic Evaluations (2024)0.00
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Sscformer: Push The Limit Of Chunk-wise Conformer For Streaming ASR Using Sequentially Sampled Chunks And Chunked Causal Convolution (2022)3.58
- Lookahead When It Matters: Adaptive Non-causal Transformers For Streaming Neural Transducers (2023)0.00
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58