Improving Transducer-based Spoken Language Understanding With Self-conditioned CTC And Knowledge Transfer
2025 Β· Vishal Sunder, Eric Fosler-Lussier
Abstract
In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially and we ensure that the SLU task is conditioned on the ASR task by having CTC self conditioning. This novel joint modeling of ASR and SLU improves SLU performance significantly over just using SLU optimization. We further improve the performance by aligning the acoustic embeddings of this model with the semantically richer BERT model. Our proposed knowledge transfer strategy makes use of a bag-of-entity prediction layer on the aligned embeddings and the output of this is used to condition the RNN-T based SLU decoding. These techniques show significant improvement over several strong baselines and can perform at par with large models like Whisper with significantly fewer p
Authors
(none)
Tags
Stats
Related papers
- End-to-end Spoken Language Understanding Using Transformer Networks And Self-supervised Pre-trained Features (2020)5.24
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Recent Advances In End-to-end Spoken Language Understanding (2019)8.09
- Large-scale Transfer Learning For Low-resource Spoken Language Understanding (2020)2.26