Improving RNN Transducer Modeling For End-to-end Speech Recognition
2019 Β· Jinyu Li, Rui Zhao, Hu Hu, et al.
Abstract
In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8% relative word error rate (WER) reduction from the baseline RNN
Authors
(none)
Tags
Stats
Related papers
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- On The Comparison Of Popular End-to-end Models For Large Scale Speech Recognition (2020)0.00
- Exploring Pre-training With Alignments For RNN Transducer Based End-to-end Speech Recognition (2020)9.41
- Developing RNN-T Models Surpassing High-performance Hybrid Models With Customization Capability (2020)13.28
- Exploring Neural Transducers For End-to-end Speech Recognition (2017)14.90
- Efficient Minimum Word Error Rate Training Of Rnn-transducer For End-to-end Speech Recognition (2020)11.19
- Exploring Rnn-transducer For Chinese Speech Recognition (2018)9.23
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59