Phonmatchnet: Phoneme-guided Zero-shot Keyword Spotting For User-defined Keywords
2023 Β· Yong-Hyeok Lee, Namhyun Cho
Abstract
This study presents a novel zero-shot user-defined keyword spotting model that utilizes the audio-phoneme relationship of the keyword to improve performance. Unlike the previous approach that estimates at utterance level, we use both utterance and phoneme level information. Our proposed method comprises a two-stream speech encoder architecture, self-attention-based pattern extractor, and phoneme-level detection loss for high performance in various pronunciation environments. Based on experimental results, our proposed model outperforms the baseline model and achieves competitive performance compared with full-shot keyword spotting models. Our proposed model significantly improves the EER and AUC across all datasets, including familiar words, proper nouns, and indistinguishable pronunciations, with an average relative improvement of 67% and 80%, respectively. The implementation code of our proposed model is available at https://github.com/ncsoft/PhonMatchNet.
Authors
(none)
Tags
Stats
Code
Related papers
- Efficientnet-absolute Zero For Continuous Speech Keyword Spotting (2020)0.00
- Slick: Exploiting Subsequences For Length-constrained Keyword Spotting (2024)5.24
- GE2E-KWS: Generalized End-to-end Training And Evaluation For Zero-shot Keyword Spotting (2024)2.26
- Phoneme-level Contrastive Learning For User-defined Keyword Spotting With Flexible Enrollment (2024)6.34
- Small-footprint Open-vocabulary Keyword Spotting With Quantized LSTM Networks (2020)0.00
- Streaming Small-footprint Keyword Spotting Using Sequence-to-sequence Models (2017)12.40
- MM-KWS: Multi-modal Prompts For Multilingual User-defined Keyword Spotting (2024)7.81
- Predicting Detection Filters For Small Footprint Open-vocabulary Keyword Spotting (2019)9.92