Streaming Attention-based Models With Augmented Memory For End-to-end Speech Recognition
2020 Β· Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, et al.
Abstract
Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streami
Authors
(none)
Tags
Stats
Related papers
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77
- An Online Attention-based Model For Speech Recognition (2018)9.59
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60