Learning Efficient Representations For Keyword Spotting With Triplet Loss
2021 Β· Roman Vygon, Nikolay Mikhaylovskiy
Abstract
In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most no-tably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10+2 -class classification by about 34%, achieving 98.55% accuracy, V2 10+2-class classification by about 20%, achieving 98.37% accuracy, and V2 35-class classification by
Authors
(none)
Tags
Stats
Related papers
- Scenario Aware Speech Recognition: Advancements For Apollo Fearless Steps & Chime-4 Corpora (2021)5.84
- Triplet Entropy Loss: Improving The Generalisation Of Short Speech Language Identification Systems (2020)0.00
- Triplet Based Embedding Distance And Similarity Learning For Text-independent Speaker Verification (2019)5.24
- Learning Acoustic Word Embeddings With Phonetically Associated Triplet Network (2018)0.00
- Towards Learning A Universal Non-semantic Representation Of Speech (2020)14.43
- Triplet Network With Attention For Speaker Diarization (2018)7.16
- Triplet Loss Based Embeddings For Forensic Speaker Identification In Spanish (2021)2.26
- End-to-end Triplet Loss Based Emotion Embedding System For Speech Emotion Recognition (2020)10.35