Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models
2022 Β· Keqi Deng, Zehui Yang, Shinji Watanabe, et al.
Abstract
While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention
Authors
(none)
Tags
Stats
Related papers
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- TSNAT: Two-step Non-autoregressvie Transformer Models For Speech Recognition (2021)10.61
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06