Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models
2022 Β· Samuel Thomas, Brian Kingsbury, George Saon, et al.
Abstract
Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data. An RNN-T model trained using both speech and text inputs improves over a baseline model trained on just speech with close to 13% word error rate (WER) reduction on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation. The usefulness of the proposed approach is further demonstrated by customizing this general purpose RNN-T model to three separate datasets. We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customiz
Authors
(none)
Tags
Stats
Related papers
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- Improving RNN Transducer Based ASR With Auxiliary Tasks (2020)9.59
- Developing RNN-T Models Surpassing High-performance Hybrid Models With Customization Capability (2020)13.28
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Alignment Restricted Streaming Recurrent Neural Network Transducer (2020)11.19
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Multiple-hypothesis RNN-T Loss For Unsupervised Fine-tuning And Self-training Of Neural Transducer (2022)0.00