Transformer In Action: A Comparative Study Of Transformer-based Acoustic Models For Large Scale Speech Recognition Applications
2020 Β· Yongqiang Wang, Yangyang Shi, Frank Zhang, et al.
Abstract
In this paper, we summarize the application of transformer and its streamable variant, Emformer based acoustic model for large scale speech recognition applications. We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. On a low latency voice assistant task, Emformer gets 24% to 26% relative word error rate reductions (WERRs). For medium latency scenarios, comparing with LCBLSTM with similar model size and latency, Emformer gets significant WERR across four languages in video captioning datasets with 2-3 times inference real-time factors reduction.
Authors
(none)
Tags
Stats
Related papers
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Transformer-based Acoustic Modeling For Hybrid Speech Recognition (2019)16.30
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- Transformer Language Models With Lstm-based Cross-utterance Information Representation (2021)10.48