Multi-stage Face-voice Association Learning With Keynote Speaker Diarization
2024 Β· Ruijie Tao, Zhan Shi, Yidi Jiang, et al.
Abstract
The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as ``cross-modal speaker verification''. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains a keynote speaker diarization front-end to effectively address the noisy speech inputs issue. To balance and enhance the intra-modal feature learning and inter-modal correlation understanding, MFV-KSD utilizes a novel three-stage training strategy. Our experimental results demonstrated robust performance, achieving the first rank in the 2024 Face-voice Association in Multilingual Environments (FAME) challenge with an overall Equal Error Rate (EER) of 19.9%. Details can be found in https://github.com/TaoRuijie/MFV-KSD.
Authors
(none)
Tags
Stats
Code
Related papers
- Cross-modal Speaker Verification And Recognition: A Multilingual Perspective (2020)0.00
- Speaker Diarization As A Fully Online Learning Problem In Minivox (2020)0.00
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Integrating Audio, Visual, And Semantic Information For Enhanced Multimodal Speaker Diarization (2024)0.00
- Contrastive Learning-based Chaining-cluster For Multilingual Voice-face Association (2024)4.78
- Multi-scale Speaker Diarization With Neural Affinity Score Fusion (2020)6.77
- Multimodal Speaker Segmentation And Diarization Using Lexical And Acoustic Cues Via Sequence To Sequence Neural Networks (2018)9.92
- Seeing Your Speech Style: A Novel Zero-shot Identity-disentanglement Face-based Voice Conversion (2024)4.52