A Streaming On-device End-to-end Model Surpassing Server-side Conventional Model Quality And Latency
2020 Β· Tara N. Sainath, Yanzhang He, Bo Li, et al.
Abstract
Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the spee
Authors
(none)
Tags
Stats
Related papers
- Two-pass End-to-end Speech Recognition (2019)13.97
- Streaming End-to-end Speech Recognition For Mobile Devices (2018)18.87
- Developing RNN-T Models Surpassing High-performance Hybrid Models With Customization Capability (2020)13.28
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- On The Comparison Of Popular End-to-end Models For Large Scale Speech Recognition (2020)0.00
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Parallel Rescoring With Transformer For Streaming On-device Speech Recognition (2020)7.50
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93