Multiple-hypothesis RNN-T Loss For Unsupervised Fine-tuning And Self-training Of Neural Transducer
2022 Β· Cong-Thanh Do, Mohan Li, Rama Doddipatla
Abstract
This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data for recurrent neural network (RNN)-Transducer (RNN-T) end-to-end (E2E) automatic speech recognition (ASR) systems. Conventional systems perform fine-tuning/self-training using ASR hypothesis as the targets when using unlabeled audio data and are susceptible to the ASR performance of the base model. Here in order to alleviate the influence of ASR errors while using unlabeled data, we propose a multiple-hypothesis RNN-T loss that incorporates multiple ASR 1-best hypotheses into the loss function. For the fine-tuning task, ASR experiments on Librispeech show that the multiple-hypothesis approach achieves a relative reduction of 14.2% word error rate (WER) when compared to the single-hypothesis approach, on the test_other set. For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data a
Authors
(none)
Tags
Stats
Related papers
- Multiple-hypothesis Ctc-based Semi-supervised Adaptation Of End-to-end Speech Recognition (2021)5.84
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59
- Alignment Restricted Streaming Recurrent Neural Network Transducer (2020)11.19
- Minimum Bayes Risk Training Of Rnn-transducer For End-to-end Speech Recognition (2019)0.00
- Multitask Learning And Joint Optimization For Transformer-rnn-transducer Speech Recognition (2020)8.09
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00