Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition
2020 Β· Wenyong Huang, Wenchao Hu, Yu Ting Yeung, et al.
Abstract
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The high computation cost of Transformer decoding also limits its use in production streaming systems. To make Transformer suitable for streaming ASR, we explore Transducer framework as a streamable way to learn alignments. For audio encoding, we apply unidirectional Transformer with interleaved convolution layers. The interleaved convolution layers are used for modeling future context which is important to performance. To reduce computation cost, we gradually downsample acoustic input, also with the interleaved convolution layers. Moreover, we limit the length of history context in self-attention to
Authors
(none)
Tags
Stats
Related papers
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00
- Streaming Transformer Transducer Based Speech Recognition Using Non-causal Convolution (2021)8.82
- Parallel Rescoring With Transformer For Streaming On-device Speech Recognition (2020)7.50
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77
- Blockwise Streaming Transformer For Spoken Language Understanding And Simultaneous Speech Translation (2022)4.52