Independent Language Modeling Architecture For End-to-end ASR
2019 Β· van Tung Pham, Haihua Xu, Yerbolat Khassanov, et al.
Abstract
The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture benefits from external text data, achieving 9.3% and 22.8% relative character and word error rate reductio
Authors
(none)
Tags
Stats
Related papers
- Internal Language Model Training For Domain-adaptive End-to-end Speech Recognition (2021)11.39
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Adaptable End-to-end ASR Models Using Replaceable Internal Lms And Residual Softmax (2023)0.00
- Internal Language Model Estimation For Domain-adaptive End-to-end Speech Recognition (2020)13.44
- Audio-attention Discriminative Language Model For ASR Rescoring (2019)9.23
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03
- Internal Language Model Estimation Through Explicit Context Vector Learning For Attention-based Encoder-decoder ASR (2022)7.50
- Investigating Methods To Improve Language Model Integration For Attention-based Encoder-decoder ASR Models (2021)0.00