Chunked Attention-based Encoder-decoder Model For Streaming Speech Recognition
2023 · Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, et al.
Abstract
We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
Authors
(none)
Tags
Stats
Related papers
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Streaming Decoder-only Automatic Speech Recognition With Discrete Speech Units: A Pilot Study (2024)4.52
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- Streaming Attention-based Models With Augmented Memory For End-to-end Speech Recognition (2020)5.84