Multitask Learning With CTC And Segmental CRF For Speech Recognition
2017 Β· Liang Lu, Lingpeng Kong, Chris Dyer, et al.
Abstract
Segmental conditional random fields (SCRFs) and connectionist temporal classification (CTC) are two sequence labeling methods used for end-to-end training of speech recognition models. Both models define a transcription probability by marginalizing decisions about latent segmentation alternatives to derive a sequence probability: the former uses a globally normalized joint model of segment labels and durations, and the latter classifies each frame as either an output symbol or a "continuation" of the previous label. In this paper, we train a recognition model by optimizing an interpolation between the SCRF and CTC losses, where the same recurrent neural network (RNN) encoder is used for feature extraction for both outputs. We find that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models. Additionally, we show that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.
Authors
(none)
Tags
Stats
Related papers
- Segmental Recurrent Neural Networks For End-to-end Speech Recognition (2016)0.00
- Training LDCRF Model On Unsegmented Sequences Using Connectionist Temporal Classification (2016)2.26
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Hierarchical Multitask Learning For Ctc-based Speech Recognition (2018)0.00
- Hierarchical Conditional End-to-end ASR With CTC And Multi-granular Subword Units (2021)9.23
- Ctc-segmentation Of Large Corpora For German End-to-end Speech Recognition (2020)12.93
- Residual Convolutional CTC Networks For Automatic Speech Recognition (2017)0.00
- CR-CTC: Consistency Regularization On CTC For Improved Speech Recognition (2024)6.30