Towards ASR Robust Spoken Language Understanding Through In-context Learning With Word Confusion Networks
2024 Β· Kevin Everson, Yile Gu, Huck Yang, et al.
Abstract
In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR perf
Authors
(none)
Tags
Stats
Related papers
- ML-LMCL: Mutual Learning And Large-margin Contrastive Learning For Improving ASR Robustness In Spoken Language Understanding (2023)0.00
- Learning Asr-robust Contextualized Embeddings For Spoken Language Understanding (2019)12.02
- Building Robust Spoken Language Understanding By Cross Attention Between Phoneme Sequence And ASR Hypothesis (2022)2.26
- Multimodal Audio-textual Architecture For Robust Spoken Language Understanding (2023)0.00
- Effectiveness Of Text, Acoustic, And Lattice-based Representations In Spoken Language Understanding Tasks (2022)2.26
- Contrastive Learning For Improving ASR Robustness In Spoken Language Understanding (2022)6.34
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Integrating Pretrained ASR And LM To Perform Sequence Generation For Spoken Language Understanding (2023)5.24