Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM
2017 Β· Takaaki Hori, Shinji Watanabe, Yu Zhang, et al.
Abstract
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.
Authors
(none)
Tags
Stats
Related papers
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Residual Convolutional CTC Networks For Automatic Speech Recognition (2017)0.00
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- Advances In All-neural Speech Recognition (2016)11.29