A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition
2023 Β· Ruchao Fan, Wei Chu, Peng Chang, et al.
Abstract
Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error r
Authors
(none)
Tags
Stats
Related papers
- An Improved Single Step Non-autoregressive Transformer For Automatic Speech Recognition (2021)0.00
- CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer For Speech Recognition (2020)10.74
- Unienc-cassnat: An Encoder-only Non-autoregressive ASR For Speech SSL Models (2024)3.58
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- TSNAT: Two-step Non-autoregressvie Transformer Models For Speech Recognition (2021)10.61