Three-module Modeling For End-to-end Spoken Language Understanding Using Pre-trained Dnn-hmm-based Acoustic-phonetic Model
2022 Β· Nick J. C. Wang, Lu Wang, Yandan Sun, et al.
Abstract
In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-phonetic model from a DNN-HMM hybrid automatic speech recognition (ASR) system instead of training one from scratch. Hence we fine-tune on speech only for the word module, and we apply multi-target learning (MTL) on the word and intent modules to jointly optimize SLU performance. MTL yields a relative reduction of 40% in intent-classification error rates (from 1.0% to 0.6%). Note that our three-module model is a streaming method. The final outcome of the proposed three-module modeling approach yields an intent accuracy of 99.4% on FluentSpeech, an intent error rate reduction of 50% compared to
Authors
(none)
Tags
Stats
Related papers
- Integrating Pretrained ASR And LM To Perform Sequence Generation For Spoken Language Understanding (2023)5.24
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- End-to-end Architectures For Asr-free Spoken Language Understanding (2019)8.60
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Understanding Semantics From Speech Through Pre-training (2019)0.00
- Recent Advances In End-to-end Spoken Language Understanding (2019)8.09