On Modular Training Of Neural Acoustics-to-word Model For LVCSR
2018 Β· Zhehuai Chen, Qi Liu, Hao Li, et al.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training a single model which integrates acoustic and language model into a whole. Although E2E training benefits from sequence modeling and simplified decoding pipelines, large amount of transcribed acoustic data is usually required, and traditional acoustic and language modelling techniques cannot be utilized. In this paper, a novel modular training framework of E2E ASR is proposed to separately train neural acoustic and language models during training stage, while still performing end-to-end inference in decoding stage. Here, an acoustics-to-phoneme model (A2P) and a phoneme-to-word model (P2W) are trained using acoustic data and text data respectively. A phone synchronous decoding (PSD) module is inserted between A2P and P2W to reduce sequence lengths without precision loss. Finally, modules are integrated into an acousticsto-word mod
Authors
(none)
Tags
Stats
Related papers
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Improving OOV Detection And Resolution With External Language Models In Acoustic-to-word ASR (2019)5.24
- Phoneme Based Neural Transducer For Large Vocabulary Speech Recognition (2020)9.59
- A Comparative Study Of Modular And Joint Approaches For Speaker-attributed ASR On Monaural Long-form Audio (2021)7.50
- Independent Language Modeling Architecture For End-to-end ASR (2019)0.00
- Acoustic Data-driven Subword Modeling For End-to-end Speech Recognition (2021)6.77
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03