Active Speaker Detection As A Multi-objective Optimization With Uncertainty-based Multimodal Fusion
2021 Β· Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, et al.
Abstract
It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.
Authors
(none)
Tags
Stats
Related papers
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Automatic Quality Assessment For Audio-visual Verification Systems. The Love Submission To NIST SRE Challenge 2019 (2020)0.00