Multi-task RNN-T With Semantic Decoder For Streamable Spoken Language Understanding
2022 Β· Xuandi Fu, Feng-Ju Chang, Martin Radfar, et al.
Abstract
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (NLU) module through an interface to infer semantic labels, such as intent and slot tags. This design, however, does not consider the NLU posterior while making transcript predictions, nor correct the NLU prediction error immediately by considering the previously predicted word-pieces. In addition, the NLU model in the two-stage system is not streamable, as it must wait for the audio segments to complete processing, which ultimately impacts the latency of the SLU system. In this work, we propose a streamable multi-task semantic transducer model to address these considerations. Our proposed archi
Authors
(none)
Tags
Stats
Related papers
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- RNN Based Incremental Online Spoken Language Understanding (2019)0.00
- Recent Advances In End-to-end Spoken Language Understanding (2019)8.09
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26
- Integrating Pretrained ASR And LM To Perform Sequence Generation For Spoken Language Understanding (2023)5.24
- End-to-end Spoken Language Understanding For Generalized Voice Assistants (2021)6.34
- Chunked Attention-based Encoder-decoder Model For Streaming Speech Recognition (2023)7.81
- Improving End-to-end Models For Set Prediction In Spoken Language Understanding (2022)0.00