Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model
2021 Β· Keqi Deng, Songjun Cao, Yike Zhang, et al.
Abstract
Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pre-training methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a \(4.6%\) character error rate (CER) on the
Authors
(none)
Tags
Stats
Related papers
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26
- Improving Transducer-based Spoken Language Understanding With Self-conditioned CTC And Knowledge Transfer (2025)0.00
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Improving Transformer-based Conversational ASR By Inter-sentential Attention Mechanism (2022)7.50
- Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM (2017)16.49