Toward Streaming ASR With Non-autoregressive Insertion-based Model
2020 Β· Yuya Fujita, Tianzi Wang, Shinji Watanabe, et al.
Abstract
Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an \(N\)-length token sequence with less than \(N\) iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and non-autoregressive ASR by a single neural network. Experimental results on Japanese and English datas
Authors
(none)
Tags
Stats
Related papers
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- Recognizing Long-form Speech Using Streaming End-to-end Models (2019)13.74
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93
- Separator-transducer-segmenter: Streaming Recognition And Segmentation Of Multi-party Speech (2022)0.00