Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech
2023 Β· Junjie Li, Ruijie Tao, Zexu Pan, et al.
Abstract
Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outp
Authors
(none)
Tags
Stats
Related papers
- Is Someone Speaking? Exploring Long-term Temporal Features For Audio-visual Active Speaker Detection (2021)21.12
- USEV: Universal Speaker Extraction With Visual Cue (2021)12.17
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24
- Selective Listening By Synchronizing Speech With Lips (2021)11.85
- Target Speaker Extraction For Overlapped Multi-talker Speaker Verification (2019)0.00
- Audio-visual Target Speaker Enhancement On Multi-talker Environment Using Event-driven Cameras (2019)8.09
- Imaginenet: Target Speaker Extraction With Intermittent Visual Cue Through Embedding Inpainting (2022)7.16
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00