Stableemit: Selection Probability Discount For Reducing Emission Latency Of Streaming Monotonic Attention ASR
2021 Β· Hirofumi Inaguma, Tatsuya Kawahara
Abstract
While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control the timing to emit tokens during training. In this work, we propose a simple alignment-free regularization method, StableEmit, to encourage MoChA to emit tokens earlier. StableEmit discounts the selection probabilities in hard monotonic attention for token boundary detection by a constant factor and regularizes them to recover the total attention mass during training. As a result, the scale of the selection probabilities is increased, and the values can reach a threshold for token emission earlier, leading to a reduction of emission latency and deletion errors. Moreover, StableEmit can be combi
Authors
(none)
Tags
Stats
Related papers
- Fastemit: Low-latency Streaming ASR With Sequence-level Emission Regularization (2020)12.61
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- Reducing Streaming ASR Model Delay With Self Alignment (2021)6.77
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60
- Minimum Latency Training Strategies For Streaming Sequence-to-sequence ASR (2020)10.07
- Streaming End-to-end Speech Recognition With Jointly Trained Neural Feature Enhancement (2021)5.84
- Chunked Attention-based Encoder-decoder Model For Streaming Speech Recognition (2023)7.81
- Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies Of Large End-to-end Models (2024)5.84