Unsupervised Voice-face Representation Learning By Cross-modal Prototype Contrast
2022 Β· Boqing Zhu, Kele Xu, Changjian Wang, et al.
Abstract
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, b
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Training Of Speaker Encoder With Multi-modal Diverse Positive Pairs (2022)8.35
- Momentum Contrast Speaker Representation Learning (2020)0.00
- Self-supervised Text-independent Speaker Verification Using Prototypical Momentum Contrastive Learning (2020)12.93
- Prototype And Instance Contrastive Learning For Unsupervised Domain Adaptation In Speaker Verification (2024)0.00
- Towards Contrastive Learning In Music Video Domain (2023)0.00
- Learning Speech Representation From Contrastive Token-acoustic Pretraining (2023)7.81
- Cross-modal Speaker Verification And Recognition: A Multilingual Perspective (2020)0.00
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23