MM-KWS: Multi-modal Prompts For Multilingual User-defined Keyword Spotting
2024 Β· Zhiqi Ai, Zhiyong Chen, Shugong Xu
Abstract
In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.
Authors
(none)
Tags
Stats
Related papers
- Phoneme-level Contrastive Learning For User-defined Keyword Spotting With Flexible Enrollment (2024)6.34
- Llm-synth4kws: Scalable Automatic Generation And Synthesis Of Confusable Data For Custom Keyword Spotting (2025)2.26
- Phonmatchnet: Phoneme-guided Zero-shot Keyword Spotting For User-defined Keywords (2023)13.34
- Bbs-kws:the Mandarin Keyword Spotting System Won The Video Keyword Wakeup Challenge (2021)0.00
- Query-by-example Keyword Spotting Using Spectral-temporal Graph Attentive Pooling And Multi-task Learning (2024)0.00
- Contrastive Augmentation: An Unsupervised Learning Approach For Keyword Spotting In Speech Technology (2024)9.92
- A Multitask Training Approach To Enhance Whisper With Contextual Biasing And Open-vocabulary Keyword Spotting (2023)0.00
- GE2E-KWS: Generalized End-to-end Training And Evaluation For Zero-shot Keyword Spotting (2024)2.26