CUSIDE: Chunking, Simulating Future Context And Decoding For Streaming ASR
2022 Β· Keyu An, Huahuan Zheng, Zhijian Ou, et al.
Abstract
History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, without waiting for future context. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our experiments. Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset.
Authors
(none)
Tags
Stats
Related papers
- CUSIDE-T: Chunking, Simulating Future And Decoding For Transducer Based Streaming ASR (2024)2.26
- Cuside-array: A Streaming Multi-channel End-to-end Speech Recognition System With Realistic Evaluations (2024)0.00
- Sscformer: Push The Limit Of Chunk-wise Conformer For Streaming ASR Using Sequentially Sampled Chunks And Chunked Causal Convolution (2022)3.58
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Improving Streaming Speech Recognition With Time-shifted Contextual Attention And Dynamic Right Context Masking (2025)2.26
- Boundary And Context Aware Training For Cif-based Non-autoregressive End-to-end ASR (2021)7.81
- Lookahead When It Matters: Adaptive Non-causal Transformers For Streaming Neural Transducers (2023)0.00
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00