An Improved Single Step Non-autoregressive Transformer For Automatic Speech Recognition
2021 Β· Ruchao Fan, Wei Chu, Peng Chang, et al.
Abstract
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods, are 3.1%/7.2% on Librispeech test clean/other sets and the CER is 5.4% on the Aishell1 test set, achi
Authors
(none)
Tags
Stats
Related papers
- CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer For Speech Recognition (2020)10.74
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- TSNAT: Two-step Non-autoregressvie Transformer Models For Speech Recognition (2021)10.61
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26
- Unienc-cassnat: An Encoder-only Non-autoregressive ASR For Speech SSL Models (2024)3.58
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00