RNN-T Models Fail To Generalize To Out-of-domain Audio: Causes And Solutions
2020 Β· Chung-Cheng Chiu, Arun Narayanan, Wei Han, et al.
Abstract
In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference. On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 14
Authors
(none)
Tags
Stats
Related papers
- Recognizing Long-form Speech Using Streaming End-to-end Models (2019)13.74
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- Developing RNN-T Models Surpassing High-performance Hybrid Models With Customization Capability (2020)13.28
- Improving RNN Transducer Based ASR With Auxiliary Tasks (2020)9.59
- Alignment Restricted Streaming Recurrent Neural Network Transducer (2020)11.19
- Generalizing Rnn-transducer To Out-domain Audio Via Sparse Self-attention Layers (2021)6.34