A Multitask Training Approach To Enhance Whisper With Contextual Biasing And Open-vocabulary Keyword Spotting
2023 Β· Yuang Li, Min Zhang, Chang Su, et al.
Abstract
The recognition of rare named entities, such as personal names and terminologies, is challenging for automatic speech recognition (ASR) systems, especially when they are not frequently observed in the training data. In this paper, we introduce keyword spotting enhanced Whisper (KWS-Whisper), a novel ASR system that leverages the Whisper model and performs open-vocabulary keyword spotting (OV-KWS) on the hidden states of the Whisper encoder to recognize user-defined named entities. These entities serve as prompts for the Whisper decoder. To optimize the model, we propose a multitask training approach that learns OV-KWS and contextual-ASR tasks. We evaluate our approach on Chinese Aishell hot word subsets and two internal code-switching test sets and show that it significantly improves the entity recall compared to the original Whisper model. Moreover, we demonstrate that the OV-KWS can be a plug-and-play module to enhance the ASR error correction methods and frozen Whisper models.
Authors
(none)
Tags
Stats
Related papers
- M2r-whisper: Multi-stage And Multi-scale Retrieval Augmentation For Enhancing Whisper (2024)6.77
- Contextual Biasing To Improve Domain-specific Custom Vocabulary Audio Transcription Without Explicit Fine-tuning Of Whisper Model (2024)4.52
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Whisperner: Unified Open Named Entity And Speech Recognition (2024)2.26
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- DCCRN-KWS: An Audio Bias Based Model For Noise Robust Small-footprint Keyword Spotting (2023)5.24
- Probing The Hidden Talent Of ASR Foundation Models For L2 English Oral Assessment (2025)0.00
- Exploring Sequence-to-sequence Transformer-transducer Models For Keyword Spotting (2022)5.24