Deep Shallow Fusion For RNN-T Personalization
2020 Β· Duc Le, Gil Keren, Julian Chan, et al.
Abstract
End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the lack of external language models and difficulties in recognizing rare long-tail words, specifically entity names. In this work, we present novel techniques to improve RNN-T's ability to model rare WordPieces, infuse extra information into the encoder, enable the use of alternative graphemic pronunciations, and perform deep fusion with personalized language models for more robust biasing. We show that these combined techniques result in 15.4%-34.5% relative Word Error Rate improvement compared to a strong RNN-T baseline which uses shallow fusion and text-to-speech augmentation. Our work helps pus
Authors
(none)
Tags
Stats
Related papers
- Contextual Adapters For Personalized Speech Recognition In Neural Transducers (2022)12.47
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Personalization Of Ctc-based End-to-end Speech Recognition Using Pronunciation-driven Subword Tokenization (2023)6.77
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59
- Developing RNN-T Models Surpassing High-performance Hybrid Models With Customization Capability (2020)13.28
- PROCTER: Pronunciation-aware Contextual Adapter For Personalized Speech Recognition In Neural Transducers (2023)8.60
- Contextualized Streaming End-to-end Speech Recognition With Trie-based Deep Biasing And Shallow Fusion (2021)13.44