Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization
2023 Β· Davide Berghi, Philip J. B. Jackson
Abstract
Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a ``student-teacher'' learning approach. A conventional pre-trained active speaker detector is adopted as a ``teacher'' network to provide the position of the speakers as pseudo-labels. The multichannel audio ``student'' network is trained to generate the same results. At inference, the stude
Authors
(none)
Tags
Stats
Related papers
- Audio Inputs For Active Speaker Detection And Localization Via Microphone Array (2023)0.00
- Is Someone Speaking? Exploring Long-term Temporal Features For Audio-visual Active Speaker Detection (2021)21.12
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- Saladnet: Self-attentive Multisource Localization In The Ambisonics Domain (2021)7.50
- How To Design A Three-stage Architecture For Audio-visual Active Speaker Detection In The Wild (2021)12.10
- Dual Mean-teacher: An Unbiased Semi-supervised Framework For Audio-visual Source Localization (2024)5.24
- Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech (2023)7.50
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00