Exploring Pre-training With Alignments For RNN Transducer Based End-to-end Speech Recognition
2020 Β· Hu Hu, Rui Zhao, Jinyu Li, et al.
Abstract
Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error
Authors
(none)
Tags
Stats
Related papers
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Hmm-free Encoder Pre-training For Streaming RNN Transducer (2021)0.00
- Exploring Neural Transducers For End-to-end Speech Recognition (2017)14.90
- Self-attention Transducers For End-to-end Speech Recognition (2019)11.93
- Alignment Restricted Streaming Recurrent Neural Network Transducer (2020)11.19
- Efficient Minimum Word Error Rate Training Of Rnn-transducer For End-to-end Speech Recognition (2020)11.19
- Exploring Rnn-transducer For Chinese Speech Recognition (2018)9.23