Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies Of Large End-to-end Models
2024 Β· Rohit Prabhavalkar, Zhong Meng, Weiran Wang, et al.
Abstract
The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encode
Authors
(none)
Tags
Stats
Related papers
- Unified End-to-end Speech Recognition And Endpointing For Fast And Efficient Speech Systems (2022)5.24
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- Transformer-based ASR Incorporating Time-reduction Layer And Fine-tuning With Self-knowledge Distillation (2021)6.34
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Reducing Bias In Production Speech Models (2017)0.00
- Efficientasr: Speech Recognition Network Compression Via Attention Redundancy And Chunk-level FFN Optimization (2024)3.58
- Transformer-based Online Speech Recognition With Decoder-end Adaptive Computation Steps (2020)7.81
- Dynamic Encoder Size Based On Data-driven Layer-wise Pruning For Speech Recognition (2024)5.24