Alignment Restricted Streaming Recurrent Neural Network Transducer
2020 Β· Jay Mahadeokar, Yuan Shangguan, Duc Le, et al.
Abstract
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such
Authors
(none)
Tags
Stats
Related papers
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- RNN-T For Latency Controlled ASR With Improved Beam Search (2019)0.00
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- An Investigation Of Monotonic Transducers For Large-scale Automatic Speech Recognition (2022)6.77
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59
- Exploring Pre-training With Alignments For RNN Transducer Based End-to-end Speech Recognition (2020)9.41
- CIF-T: A Novel Cif-based Transducer Architecture For Automatic Speech Recognition (2023)0.00
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21