Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition
2020 Β· Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, et al.
Abstract
In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio encoding with no lookahead or right context and an additional stack of transformer layers on top trained with variable right context. In inference time, the context length for the variable context layers can be changed to trade off the latency and the accuracy of the model. We also show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes. This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy (20% relative improvement for voice-search task). We show that with limited right context (1-2 seconds of audio) and small additional latency (50-100 milliseconds) at the end of decoding
Authors
(none)
Tags
Stats
Related papers
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Streaming Transformer Transducer Based Speech Recognition Using Non-causal Convolution (2021)8.82
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Streaming Joint Speech Recognition And Disfluency Detection (2022)0.00
- Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition (2024)3.58