Towards Online End-to-end Transformer Automatic Speech Recognition
2019 Β· Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, et al.
Abstract
The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. In this paper, we extend it towards an entire online E2E ASR system by introducing an online decoding process inspired by monotonic chunkwise attention (MoChA) into the Transformer decoder. Our novel MoChA training and inference algorithms exploit the unique properties of Transformer, whose attentions are not always monotonic or peaky, and have multiple heads and residual
Authors
(none)
Tags
Stats
Related papers
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06
- Transformer ASR With Contextual Block Processing (2019)0.00
- Transformer-based Online Speech Recognition With Decoder-end Adaptive Computation Steps (2020)7.81
- Improving Transformer-based Conversational ASR By Inter-sentential Attention Mechanism (2022)7.50
- Online Hybrid Ctc/attention End-to-end Automatic Speech Recognition Architecture (2023)12.99
- Attention-based ASR With Lightweight And Dynamic Convolutions (2019)9.03
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08