Recognizing Long-form Speech Using Streaming End-to-end Models
2019 Β· Arun Narayanan, Rohit Prabhavalkar, Chung-Cheng Chiu, et al.
Abstract
All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized long-form test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40% relative. Simulating long-form training on top of data diversity improves performance by an additional 2
Authors
(none)
Tags
Stats
Related papers
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93
- RNN-T Models Fail To Generalize To Out-of-domain Audio: Causes And Solutions (2020)8.82
- Large-scale Multilingual Speech Recognition With A Streaming End-to-end Model (2019)14.97
- Investigating End-to-end ASR Architectures For Long Form Audio Transcription (2023)6.34
- Two-pass End-to-end Speech Recognition (2019)13.97
- Streaming End-to-end Speech Recognition For Mobile Devices (2018)18.87
- On The Comparison Of Popular End-to-end Models For Large Scale Speech Recognition (2020)0.00
- Toward Streaming ASR With Non-autoregressive Insertion-based Model (2020)4.52