Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization
2024 Β· Jenthe Thienpondt, Kris Demuynck
Abstract
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
Authors
(none)
Tags
Stats
Related papers
- Joint Training Of Speaker Embedding Extractor, Speech And Overlap Detection For Diarization (2024)2.26
- Leveraging Speaker Embeddings In End-to-end Neural Diarization For Two-speaker Scenarios (2024)0.00
- Look Who's Not Talking (2020)0.00
- Speaker Diarisation Using 2D Self-attentive Combination Of Embeddings (2019)9.92
- Speaker Diarization With LSTM (2017)17.48
- Advancing The Dimensionality Reduction Of Speaker Embeddings For Speaker Diarisation: Disentangling Noise And Informing Speech Activity (2021)2.26
- TS-SEP: Joint Diarization And Separation Conditioned On Estimated Speaker Embeddings (2023)10.35
- Target-speaker Voice Activity Detection: A Novel Approach For Multi-speaker Diarization In A Dinner Party Scenario (2020)16.19