Late Audio-visual Fusion For In-the-wild Speaker Diarization
2022 · Zexu Pan, Gordon Wichern, François G. Germain, et al.
Abstract
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speaker Diarization Based On Spatiotemporal Bayesian Fusion (2016)14.51
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Integrating Audio, Visual, And Semantic Information For Enhanced Multimodal Speaker Diarization (2024)0.00
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Data Fusion For Audiovisual Speaker Localization: Extending Dynamic Stream Weights To The Spatial Domain (2021)3.58
- Probabilistic Fusion And Calibration Of Neural Speaker Diarization Models (2025)0.00
- Multi-scale Speaker Diarization With Neural Affinity Score Fusion (2020)6.77
- Spot The Conversation: Speaker Diarisation In The Wild (2020)15.31