Fine-tuning Wav2vec2 For Speaker Recognition
2021 Β· Nik Vaessen, David A. van Leeuwen
Abstract
This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.
Authors
(none)
Tags
Stats
Code
Related papers
- Exploring Wav2vec 2.0 On Speaker Verification And Language Identification (2020)15.59
- Multitask Detection Of Speaker Changes, Overlapping Speech And Voice Activity Using Wav2vec 2.0 (2022)11.86
- Exploring Wav2vec 2.0 Fine-tuning For Improved Speech Emotion Recognition (2021)15.67
- Training Speaker Recognition Systems With Limited Data (2022)8.13
- ECAPA2: A Hybrid Neural Network Architecture And Training Strategy For Robust Speaker Embeddings (2024)0.00
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Vec2wav 2.0: Advancing Voice Conversion Via Discrete Token Vocoders (2024)0.00
- Joint Speaker Features Learning For Audio-visual Multichannel Speech Separation And Recognition (2024)0.00