Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding
2023 Β· Mohan Li, Rama Doddipatla
Abstract
This paper presents the use of non-autoregressive (NAR) approaches for joint automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. The proposed NAR systems employ a Conformer encoder that applies connectionist temporal classification (CTC) to transcribe the speech utterance into raw ASR hypotheses, which are further refined with a bidirectional encoder representations from Transformers (BERT)-like decoder. In the meantime, the intent and slot labels of the utterance are predicted simultaneously using the same decoder. Both Mask-CTC and self-conditioned CTC (SC-CTC) approaches are explored for this study. Experiments conducted on the SLURP dataset show that the proposed SC-Mask-CTC NAR system achieves 3.7% and 3.2% absolute gains in SLU metrics and a competitive level of ASR accuracy, when compared to a Conformer-Transformer based autoregressive (AR) model. Additionally, the NAR systems achieve 6x faster decoding speed than the AR baseline.
Authors
(none)
Tags
Stats
Related papers
- A Comparative Study On Non-autoregressive Modelings For Speech-to-text Generation (2021)11.76
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Non-autoregressive Transformer With Unified Bidirectional Decoder For Automatic Speech Recognition (2021)7.81
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- Improving Transducer-based Spoken Language Understanding With Self-conditioned CTC And Knowledge Transfer (2025)0.00
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97