Unified End-to-end Speech Recognition And Endpointing For Fast And Efficient Speech Systems
2022 Β· Shaan Bijwadia, Shuo-Yiin Chang, Bo Li, et al.
Abstract
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by
Authors
(none)
Tags
Stats
Related papers
- E2e-based Multi-task Learning Approach To Joint Speech And Accent Recognition (2021)0.00
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- End-to-end Dereverberation, Beamforming, And Speech Recognition With Improved Numerical Stability And Advanced Frontend (2021)10.97
- Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies Of Large End-to-end Models (2024)5.84
- Two-pass End-to-end Speech Recognition (2019)13.97
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00