Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech
2024 Β· Yu Xi, Baochen Yang, Hao Li, et al.
Abstract
Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable per
Authors
(none)
Tags
Stats
Related papers
- Phoneme-level Contrastive Learning For User-defined Keyword Spotting With Flexible Enrollment (2024)6.34
- Sequence Discriminative Training For Deep Learning Based Acoustic Keyword Spotting (2018)8.35
- Contrastive Augmentation: An Unsupervised Learning Approach For Keyword Spotting In Speech Technology (2024)9.92
- Llm-synth4kws: Scalable Automatic Generation And Synthesis Of Confusable Data For Custom Keyword Spotting (2025)2.26
- DCCRN-KWS: An Audio Bias Based Model For Noise Robust Small-footprint Keyword Spotting (2023)5.24
- Exploring Representation Learning For Small-footprint Keyword Spotting (2023)3.58
- Slick: Exploiting Subsequences For Length-constrained Keyword Spotting (2024)5.24
- Streaming Keyword Spotting Boosted By Cross-layer Discrimination Consistency (2024)6.34