A Density Ratio Approach To Language Model Fusion In End-to-end Automatic Speech Recognition
2020 Β· Erik McDermott, Hasim Sak, Ehsan Variani
Abstract
This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a manner directly analogous to the classic hybrid model for ASR based on Deep Neural Networks (DNNs) or LSTMs in the Hidden Markov Model (HMM) framework (Bourlard & Morgan, 1994). The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) \{audio, transcript\} training data pairs are used to train the RNN-T. Specifically, an RNN-T model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data. The Density Rati
Authors
(none)
Tags
Stats
Related papers
- An Empirical Study Of Language Model Integration For Transducer Based Speech Recognition (2022)3.58
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- On Language Model Integration For RNN Transducer Based Speech Recognition (2021)9.59
- Internal Language Model Estimation For Domain-adaptive End-to-end Speech Recognition (2020)13.44
- Multilingual And Fully Non-autoregressive ASR With Large Language Model Fusion: A Comprehensive Study (2024)0.00
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition (2024)3.58