Target Speaker ASR With Whisper
2024 Β· Alexander Polok, Dominik Klement, Matthew Wiesner, et al.
Abstract
We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single-speaker ASR models into target-speaker ASR models. Our approach also supports speaker-attributed ASR by sequentially generating transcripts for each speaker in a diarization output. This simplified method outperforms baseline speech separation and diarization cascade by 12.9 % absolute ORC-WER on the NOTSOFAR-1 dataset.
Authors
(none)
Tags
Stats
Related papers
- Dicow: Diarization-conditioned Whisper For Target Speaker Automatic Speech Recognition (2024)8.09
- Adapting Diarization-conditioned Whisper For End-to-end Multi-talker Speech Recognition (2025)0.00
- Extending Whisper With Prompt Tuning To Target-speaker ASR (2023)9.59
- Simultaneous Speech Recognition And Speaker Diarization For Monaural Dialogue Recordings With Target-speaker Acoustic Models (2019)0.00
- Speaker Conditioned Acoustic Modeling For Multi-speaker Conversational ASR (2021)4.52
- M2r-whisper: Multi-stage And Multi-scale Retrieval Augmentation For Enhancing Whisper (2024)6.77
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Elevating Robust Multi-talker ASR By Decoupling Speaker Separation And Speech Recognition (2025)0.00