BERT Meets CTC: New Formulation Of End-to-end Speech Recognition With Pre-trained Masked Language Model
2022 Β· Yosuke Higuchi, Brian Yan, Siddhant Arora, et al.
Abstract
This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken lan
Authors
(none)
Tags
Stats
Related papers
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- CR-CTC: Consistency Regularization On CTC For Improved Speech Recognition (2024)6.30
- Advances In All-neural Speech Recognition (2016)11.29
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Self-attention Networks For Connectionist Temporal Classification In Speech Recognition (2019)14.55
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Disentangling Speakers In Multi-talker Speech Recognition With Speaker-aware CTC (2024)4.98