Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition
2021 Β· Hirofumi Inaguma, Tatsuya Kawahara
Abstract
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder
Authors
(none)
Tags
Stats
Related papers
- Online Hybrid Ctc/attention End-to-end Automatic Speech Recognition Architecture (2023)12.99
- Stableemit: Selection Probability Discount For Reducing Emission Latency Of Streaming Monotonic Attention ASR (2021)3.58
- Streaming End-to-end Speech Recognition With Jointly Trained Neural Feature Enhancement (2021)5.84
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58