On Comparison Of Encoders For Attention Based End To End Speech Recognition In Standalone And Rescoring Mode
2022 Β· Raviraj Joshi, Subodh Kumar
Abstract
The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model. In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative W
Authors
(none)
Tags
Stats
Related papers
- On The Comparison Of Popular End-to-end Models For Large Scale Speech Recognition (2020)0.00
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- Parallel Rescoring With Transformer For Streaming On-device Speech Recognition (2020)7.50
- Attention Based End To End Speech Recognition For Voice Search In Hindi And English (2021)6.77
- Audio-attention Discriminative Language Model For ASR Rescoring (2019)9.23
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93
- A Comparison Of Semi-supervised Learning Techniques For Streaming ASR At Scale (2023)2.26