Sscformer: Push The Limit Of Chunk-wise Conformer For Streaming ASR Using Sequentially Sampled Chunks And Chunked Causal Convolution
2022 Β· Fangyuan Wang, Bo Xu, Bo Xu
Abstract
Currently, the chunk-wise schemes are often used to make Automatic Speech Recognition (ASR) models to support streaming deployment. However, existing approaches are unable to capture the global context, lack support for parallel training, or exhibit quadratic complexity for the computation of multi-head self-attention (MHSA). On the other side, the causal convolution, no future context used, has become the de facto module in streaming Conformer. In this paper, we propose SSCFormer to push the limit of chunk-wise Conformer for streaming ASR using the following two techniques: 1) A novel cross-chunks context generation method, named Sequential Sampling Chunk (SSC) scheme, to re-partition chunks from regular partitioned chunks to facilitate efficient long-term contextual interaction within local chunks. 2)The Chunked Causal Convolution (C2Conv) is designed to concurrently capture the left context and chunk-wise future context. Evaluations on AISHELL-1 show that an End-to-End (E2E) CER 5.3
Authors
(none)
Tags
Stats
Related papers
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Dctx-conformer: Dynamic Context Carry-over For Low Latency Unified Streaming And Non-streaming Conformer ASR (2023)2.26
- CUSIDE: Chunking, Simulating Future Context And Decoding For Streaming ASR (2022)7.50
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60
- CUSIDE-T: Chunking, Simulating Future And Decoding For Transducer Based Streaming ASR (2024)2.26
- Stateful Conformer With Cache-based Inference For Streaming Automatic Speech Recognition (2023)8.60
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00
- Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems (2023)7.16