Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding
2023 Β· Suyoun Kim, Akshat Shrivastava, Duc Le, et al.
Abstract
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.
Authors
(none)
Tags
Stats
Related papers
- Multimodal Audio-textual Architecture For Robust Spoken Language Understanding (2023)0.00
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- Tie Your Embeddings Down: Cross-modal Latent Spaces For End-to-end Spoken Language Understanding (2020)9.03
- End-to-end Spoken Language Understanding For Generalized Voice Assistants (2021)6.34
- A Study On The Integration Of Pipeline And E2E SLU Systems For Spoken Semantic Parsing Toward STOP Quality Challenge (2023)2.26
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Integrating Pretrained ASR And LM To Perform Sequence Generation For Spoken Language Understanding (2023)5.24
- End-to-end Spoken Language Understanding: Performance Analyses Of A Voice Command Task In A Low Resource Setting (2022)8.35