A Comparison Of End-to-end Models For Long-form Speech Recognition
2019 Β· Chung-Cheng Chiu, Wei Han, Yu Zhang, et al.
Abstract
End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long ut
Authors
(none)
Tags
Stats
Related papers
- Investigating End-to-end ASR Architectures For Long Form Audio Transcription (2023)6.34
- On The Comparison Of Popular End-to-end Models For Large Scale Speech Recognition (2020)0.00
- Recognizing Long-form Speech Using Streaming End-to-end Models (2019)13.74
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Exploring Neural Transducers For End-to-end Speech Recognition (2017)14.90
- A Comparative Study On Neural Architectures And Training Methods For Japanese Speech Recognition (2021)7.50
- A Comparison Of Label-synchronous And Frame-synchronous End-to-end Models For Speech Recognition (2020)0.00