Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition
2020 Β· Shiliang Zhang, Zhifu Gao, Haoneng Luo, et al.
Abstract
Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is
Authors
(none)
Tags
Stats
Related papers
- An Online Attention-based Model For Speech Recognition (2018)9.59
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Stream Attention-based Multi-array End-to-end Speech Recognition (2018)0.00
- Online Hybrid Ctc/attention End-to-end Automatic Speech Recognition Architecture (2023)12.99
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58