Minimum Latency Training Strategies For Streaming Sequence-to-sequence ASR
2020 Β· Hirofumi Inaguma, Yashesh Gaur, Liang Lu, et al.
Abstract
Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and (4) directly minimize the differentiable expected latency loss. Experiments on the Cortana voice sear
Authors
(none)
Tags
Stats
Related papers
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58
- Minimum Latency Training Of Sequence Transducers For Streaming End-to-end Speech Recognition (2022)0.00
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00
- Reducing Streaming ASR Model Delay With Self Alignment (2021)6.77
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58
- Latency-controlled Neural Architecture Search For Streaming Speech Recognition (2021)0.00