Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning
2021 Β· Qian Chen, Wen Wang, Qinglin Zhang
Abstract
In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also investigate the efficacy of combining textual and phonetic information during fine-tuning. Experimental resul
Authors
(none)
Tags
Stats
Related papers
- SPLAT: Speech-language Joint Pre-training For Spoken Language Understanding (2020)10.35
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Understanding Semantics From Speech Through Pre-training (2019)0.00
- Integrating Pretrained ASR And LM To Perform Sequence Generation For Spoken Language Understanding (2023)5.24
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- A Study On The Integration Of Pre-trained SSL, ASR, LM And SLU Models For Spoken Language Understanding (2022)8.09