Hierarchical Conditional End-to-end ASR With CTC And Multi-granular Subword Units
2021 Β· Yosuke Higuchi, Keita Karube, Tetsuji Ogawa, et al.
Abstract
In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we propose a hierarchical conditional model that is based on connectionist temporal classification (CTC). Our model is trained by auxiliary CTC losses applied to intermediate layers, where the vocabulary size of each target subword sequence is gradually increased as the layer becomes close to the word-level output. Here, we make each level of sequence prediction explicitly conditioned on the previous sequences predicted at lower levels. With the proposed approach, we expect the proposed model to learn the word-level representations effectively by exploiting a hierarchy of linguistic structures. Expe
Authors
(none)
Tags
Stats
Related papers
- Multitask Learning With CTC And Segmental CRF For Speech Recognition (2017)0.00
- Hierarchical Multitask Learning For Ctc-based Speech Recognition (2018)0.00
- Improving Transducer-based Spoken Language Understanding With Self-conditioned CTC And Knowledge Transfer (2025)0.00
- Residual Convolutional CTC Networks For Automatic Speech Recognition (2017)0.00
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- Relaxing The Conditional Independence Assumption Of Ctc-based ASR By Conditioning On Intermediate Predictions (2021)13.34
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Alternate Intermediate Conditioning With Syllable-level And Character-level Targets For Japanese ASR (2022)0.00