A Comparative Study On Non-autoregressive Modelings For Speech-to-text Generation
2021 Β· Yosuke Higuchi, Nanxin Chen, Yuya Fujita, et al.
Abstract
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing.
Authors
(none)
Tags
Stats
Related papers
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- Effectiveasr: A Single-step Non-autoregressive Mandarin Speech Recognition Architecture With High Accuracy And Inference Speed (2024)3.58
- TSNAT: Two-step Non-autoregressvie Transformer Models For Speech Recognition (2021)10.61
- Orthros: Non-autoregressive End-to-end Speech Translation With Dual-decoder (2020)7.50
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93
- Knowledge Transfer And Distillation From Autoregressive To Non-autoregressive Speech Recognition (2022)0.00
- A Comparative Study On Neural Architectures And Training Methods For Japanese Speech Recognition (2021)7.50