Investigating End-to-end ASR Architectures For Long Form Audio Transcription
2023 Β· Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, et al.
Abstract
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
Authors
(none)
Tags
Stats
Related papers
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93
- Recognizing Long-form Speech Using Streaming End-to-end Models (2019)13.74
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Audio-attention Discriminative Language Model For ASR Rescoring (2019)9.23
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Survey Of End-to-end Multi-speaker Automatic Speech Recognition For Monaural Audio (2025)2.26
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- A Comparative Study On Neural Architectures And Training Methods For Japanese Speech Recognition (2021)7.50