Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning
2016 Β· Suyoun Kim, Takaaki Hori, Shinji Watanabe
Abstract
Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speec
Authors
(none)
Tags
Stats
Related papers
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM (2017)16.49
- Multi-stream End-to-end Speech Recognition (2019)8.35
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00
- Multitask Learning With CTC And Segmental CRF For Speech Recognition (2017)0.00
- E2e-based Multi-task Learning Approach To Joint Speech And Accent Recognition (2021)0.00
- Joint Beam Search Integrating CTC, Attention, And Transducer Decoders (2024)5.24
- Self-attention Networks For Connectionist Temporal Classification In Speech Recognition (2019)14.55