Two-pass End-to-end Speech Recognition
2019 Β· Tara N. Sainath, Ruoming Pang, David Rybach, et al.
Abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves
Authors
(none)
Tags
Stats
Related papers
- A Streaming On-device End-to-end Model Surpassing Server-side Conventional Model Quality And Latency (2020)15.00
- Streaming End-to-end Speech Recognition For Mobile Devices (2018)18.87
- Recognizing Long-form Speech Using Streaming End-to-end Models (2019)13.74
- Listen Attentively, And Spell Once: Whole Sentence Generation Via A Non-autoregressive Architecture For Low-latency Speech Recognition (2020)10.07
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Large-scale Multilingual Speech Recognition With A Streaming End-to-end Model (2019)14.97
- Unified End-to-end Speech Recognition And Endpointing For Fast And Efficient Speech Systems (2022)5.24
- Parallel Rescoring With Transformer For Streaming On-device Speech Recognition (2020)7.50