Stateful Conformer With Cache-based Inference For Streaming Automatic Speech Recognition
2023 Β· Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, et al.
Abstract
In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large sc
Authors
(none)
Tags
Stats
Related papers
- Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition (2023)14.47
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Efficient Conformer: Progressive Downsampling And Grouped Attention For Automatic Speech Recognition (2021)13.79
- Dctx-conformer: Dynamic Context Carry-over For Low Latency Unified Streaming And Non-streaming Conformer ASR (2023)2.26
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Adding Connectionist Temporal Summarization Into Conformer To Improve Its Decoder Efficiency For Speech Recognition (2022)0.00
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00