Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset
2020 Β· Xie Chen, Yu Wu, Zhenghao Wang, et al.
Abstract
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.
Authors
(none)
Tags
Stats
Related papers
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Parallel Rescoring With Transformer For Streaming On-device Speech Recognition (2020)7.50
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Transformer In Action: A Comparative Study Of Transformer-based Acoustic Models For Large Scale Speech Recognition Applications (2020)9.41
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00