Whisperrt -- Turning Whisper Into A Causal Streaming Model
2025 Β· Tomer Krichli, Bhiksha Raj, Joseph Keshet
Abstract
Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model. The encoder is made causal to process audio incrementally, while the decoder conditions on partial encoder states to generate tokens aligned with the available temporal context. This requires explicit synchronization between encoded input frames and token emissions. Since tokens are produced only after sufficient acoustic evidence is observed, an inherent latency arises, necessitating fine-tuning of the encoder-decoder alignment mechanism. We propose an updated inference mechanism that utilizes the fine-tuned causal encoder and decoder to yield gre
Authors
(none)
Tags
Stats
Related papers
- Simul-whisper: Attention-guided Streaming Whisper With Truncation Detection (2024)6.34
- Lookahead When It Matters: Adaptive Non-causal Transformers For Streaming Neural Transducers (2023)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation With Whisper (2024)2.26
- Streaming Transformer Transducer Based Speech Recognition Using Non-causal Convolution (2021)8.82
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00
- Target Speaker ASR With Whisper (2024)7.16
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00