Improving Streaming Speech Recognition With Time-shifted Contextual Attention And Dynamic Right Context Masking
2025 Β· Khanh Le, Duc Chau
Abstract
Chunk-based inference stands out as a popular approach in developing real-time streaming speech recognition, valued for its simplicity and efficiency. However, because it restricts the model's focus to only the history and current chunk context, it may result in performance degradation in scenarios that demand consideration of future context. Addressing this, we propose a novel approach featuring Time-Shifted Contextual Attention (TSCA) and Dynamic Right Context (DRC) masking. Our method shows a relative word error rate reduction of 10 to 13.9% on the Librispeech dataset with the inclusion of in-context future information provided by TSCA. Moreover, we present a streaming automatic speech recognition pipeline that facilitates the integration of TSCA with minimal user-perceived latency, while also enabling batch processing capability, making it practical for various applications.
Authors
(none)
Tags
Stats
Related papers
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Contextualized Streaming End-to-end Speech Recognition With Trie-based Deep Biasing And Shallow Fusion (2021)13.44
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Adaptive Contextual Biasing For Transducer Based Streaming Speech Recognition (2023)7.16
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60