Fast End-to-end Speech Recognition Via Non-autoregressive Models And Cross-modal Knowledge Transferring From BERT
2021 Β· Ye Bai, Jiangyan Yi, Jianhua Tao, et al.
Abstract
Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speec
Authors
(none)
Tags
Stats
Related papers
- Listen Attentively, And Spell Once: Whole Sentence Generation Via A Non-autoregressive Architecture For Low-latency Speech Recognition (2020)10.07
- An Online Attention-based Model For Speech Recognition (2018)9.59
- State-of-the-art Speech Recognition With Sequence-to-sequence Models (2017)21.01
- Two-pass End-to-end Speech Recognition (2019)13.97
- Hybrid Transducer And Attention Based Encoder-decoder Modeling For Speech-to-text Tasks (2023)6.77
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Multi-dialect Speech Recognition With A Single Sequence-to-sequence Model (2017)13.79