4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders
2022 Β· Yui Sudo, Muhammad Shakeel, Brian Yan, et al.
Abstract
The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on t
Authors
(none)
Tags
Stats
Related papers
- Joint Beam Search Integrating CTC, Attention, And Transducer Decoders (2024)5.24
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- Decoupling And Interacting Multi-task Learning Network For Joint Speech And Accent Recognition (2023)9.03
- Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM (2017)16.49
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- 3M: Multi-loss, Multi-path And Multi-level Neural Networks For Speech Recognition (2022)8.67