Phoneme-level Contrastive Learning For User-defined Keyword Spotting With Flexible Enrollment
2024 Β· Li Kewei, Zhou Hengshun, Shen Kai, et al.
Abstract
User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically d
Authors
(none)
Tags
Stats
Related papers
- Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech (2024)0.00
- Llm-synth4kws: Scalable Automatic Generation And Synthesis Of Confusable Data For Custom Keyword Spotting (2025)2.26
- Contrastive Augmentation: An Unsupervised Learning Approach For Keyword Spotting In Speech Technology (2024)9.92
- MM-KWS: Multi-modal Prompts For Multilingual User-defined Keyword Spotting (2024)7.81
- Exploring Representation Learning For Small-footprint Keyword Spotting (2023)3.58
- Sequence Discriminative Training For Deep Learning Based Acoustic Keyword Spotting (2018)8.35
- DCCRN-KWS: An Audio Bias Based Model For Noise Robust Small-footprint Keyword Spotting (2023)5.24
- Phonmatchnet: Phoneme-guided Zero-shot Keyword Spotting For User-defined Keywords (2023)13.34