Simul-whisper: Attention-guided Streaming Whisper With Truncation Detection
2024 Β· Haoyu Wang, Guoqiang Hu, Guodong Lin, et al.
Abstract
As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.
Authors
(none)
Tags
Stats
Related papers
- Whisperrt -- Turning Whisper Into A Causal Streaming Model (2025)0.00
- Dq-whisper: Joint Distillation And Quantization For Efficient Multilingual Speech Recognition (2023)4.52
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Adapting Whisper For Code-switching Through Encoding Refining And Language-aware Decoding (2024)0.00
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- M2r-whisper: Multi-stage And Multi-scale Retrieval Augmentation For Enhancing Whisper (2024)6.77
- Speaking Clearly: A Simplified Whisper-based Codec For Low-bitrate Speech Coding (2025)3.16
- Extending Whisper With Prompt Tuning To Target-speaker ASR (2023)9.59