Dctx-conformer: Dynamic Context Carry-over For Low Latency Unified Streaming And Non-streaming Conformer ASR
2023 Β· Goeric Huybrechts, Srikanth Ronanki, Xilai Li, et al.
Abstract
Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
Authors
(none)
Tags
Stats
Related papers
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems (2023)7.16
- Sscformer: Push The Limit Of Chunk-wise Conformer For Streaming ASR Using Sequentially Sampled Chunks And Chunked Causal Convolution (2022)3.58
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- Stateful Conformer With Cache-based Inference For Streaming Automatic Speech Recognition (2023)8.60
- Dualvc 3: Leveraging Language Model Generated Pseudo Context For End-to-end Low Latency Streaming Voice Conversion (2024)0.00
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07