Self-supervised Training Of Speaker Encoder With Multi-modal Diverse Positive Pairs
2022 Β· Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, et al.
Abstract
We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any
Authors
(none)
Tags
Stats
Related papers
- Self-distillation Prototypes Network: Learning Robust Speaker Representations Without Supervision (2023)4.52
- Unsupervised Voice-face Representation Learning By Cross-modal Prototype Contrast (2022)10.35
- Curriculum Learning For Self-supervised Speaker Verification (2022)8.09
- An Iterative Framework For Self-supervised Deep Speaker Representation Learning (2020)10.61
- Self-supervised Learning From Contrastive Mixtures For Personalized Speech Enhancement (2020)0.00
- Augmentation Adversarial Training For Self-supervised Speaker Recognition (2020)0.00
- Self-supervised Speaker Verification With Simple Siamese Network And Self-supervised Regularization (2021)10.85
- Experimenting With Additive Margins For Contrastive Self-supervised Speaker Verification (2023)4.52