Joint Beam Search Integrating CTC, Attention, And Transducer Decoders
2024 Β· Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, et al.
Abstract
End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and Mask-CTC models. Each decoder architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and Mask-CTC) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained jointly, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes the joint training. In addition, we propose three novel joint beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improv
Authors
(none)
Tags
Stats
Related papers
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM (2017)16.49
- A Fully Differentiable Beam Search Decoder (2019)0.00
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- Integration Of Frame- And Label-synchronous Beam Search For Streaming Encoder-decoder Speech Recognition (2023)0.00
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26