Improved Mask-ctc For Non-autoregressive End-to-end ASR
2020 Β· Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, et al.
Abstract
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC signific
Authors
(none)
Tags
Stats
Related papers
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- BERT Meets CTC: New Formulation Of End-to-end Speech Recognition With Pre-trained Masked Language Model (2022)0.00
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- Non-autoregressive Error Correction For Ctc-based ASR With Phone-conditioned Masked LM (2022)5.84