Effectiveasr: A Single-step Non-autoregressive Mandarin Speech Recognition Architecture With High Accuracy And Inference Speed
2024 Β· Ziyang Zhuang, Chenfeng Miao, Kun Zou, et al.
Abstract
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.
Authors
(none)
Tags
Stats
Related papers
- A Comparative Study On Non-autoregressive Modelings For Speech-to-text Generation (2021)11.76
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Fireredasr: Open-source Industrial-grade Mandarin Speech Recognition Models From Encoder-decoder To LLM Integration (2025)6.54
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- Improved Conformer-based End-to-end Speech Recognition Using Neural Architecture Search (2021)0.00
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- An Improved Single Step Non-autoregressive Transformer For Automatic Speech Recognition (2021)0.00