Paraformer-v2: An Improved Non-autoregressive Transformer For Noise-robust Speech Recognition
2024 Β· Keyu An, Zerui Li, Zhifu Gao, et al.
Abstract
Attention-based encoder-decoder, e.g. transformer and its variants, generates the output sequence in an autoregressive (AR) manner. Despite its superior performance, AR model is computationally inefficient as its generation requires as many iterations as the output length. In this paper, we propose Paraformer-v2, an improved version of Paraformer, for fast, accurate, and noise-robust non-autoregressive speech recognition. In Paraformer-v2, we use a CTC module to extract the token embeddings, as the alternative to the continuous integrate-and-fire module in Paraformer. Extensive experiments demonstrate that Paraformer-v2 outperforms Paraformer on multiple datasets, especially on the English datasets (over 14% improvement on WER), and is more robust in noisy environments.
Authors
(none)
Tags
Stats
Related papers
- Paraformer: Fast And Accurate Parallel Transformer For Non-autoregressive End-to-end Speech Recognition (2022)15.10
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- TSNAT: Two-step Non-autoregressvie Transformer Models For Speech Recognition (2021)10.61
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Non-autoregressive Neural Text-to-speech (2019)0.00
- An Improved Single Step Non-autoregressive Transformer For Automatic Speech Recognition (2021)0.00
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82