Reducing Streaming ASR Model Delay With Self Alignment
2021 Β· Jaeyoung Kim, Han Lu, Anshuman Tripathi, et al.
Abstract
Reducing prediction delay for streaming end-to-end ASR models with minimal performance regression is a challenging problem. Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models. On the contrary, recently proposed FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks without any reference alignments. Although all these schemes are successful in reducing delay, ASR word error rate (WER) often severely degrades after applying these delay constraining schemes. In this paper, we propose a novel delay constraining method, named self alignment. Self alignment does not require external alignment models. Instead, it utilizes Viterbi forced-alignments from the trained model to find the lower latency alignment direction. From LibriSpeech evaluation, self alignment outperformed existing schemes: 25% and 56% less delay compared to FastEmit and constrained alignment at
Authors
(none)
Tags
Stats
Related papers
- Fastemit: Low-latency Streaming ASR With Sequence-level Emission Regularization (2020)12.61
- Minimum Latency Training Strategies For Streaming Sequence-to-sequence ASR (2020)10.07
- Stableemit: Selection Probability Discount For Reducing Emission Latency Of Streaming Monotonic Attention ASR (2021)3.58
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58
- Improving Streaming Transformer Based ASR Under A Framework Of Self-supervised Learning (2021)8.09
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00
- Iterative Pseudo-forced Alignment By Acoustic CTC Loss For Self-supervised ASR Domain Adaptation (2022)0.00
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16