Training Speaker Recognition Systems With Limited Data
2022 Β· Nik Vaessen, David A. van Leeuwen
Abstract
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50\,k audio files (versus over 1\,M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at https://github.com/nikvaessen/w2v2-speaker-few-samples.
Authors
(none)
Tags
Stats
Code
Related papers
- Fine-tuning Wav2vec2 For Speaker Recognition (2021)18.88
- DNN Based Speaker Recognition On Short Utterances (2016)0.00
- Weakly Supervised Training Of Speaker Identification Models (2018)5.84
- Voxceleb2: Deep Speaker Recognition (2018)23.96
- Length- And Noise-aware Training Techniques For Short-utterance Speaker Recognition (2020)0.00
- Speech2phone: A Novel And Efficient Method For Training Speaker Recognition Models (2020)2.26
- On Scaling Contrastive Representations For Low-resource Speech Recognition (2021)3.58
- Training Speaker Embedding Extractors Using Multi-speaker Audio With Unknown Speaker Boundaries (2022)3.58