Is Someone Speaking? Exploring Long-term Temporal Features For Audio-visual Active Speaker Detection
2021 Β· Ruijie Tao, Zexu Pan, Rohan Kumar Das, et al.
Abstract
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD.
Authors
(none)
Tags
Stats
Code
Related papers
- Talknce: Improving Active Speaker Detection With Talk-aware Contrastive Learning (2023)0.00
- How To Design A Three-stage Architecture For Audio-visual Active Speaker Detection In The Wild (2021)12.10
- Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech (2023)7.50
- Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization (2023)6.77
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24
- Audio Inputs For Active Speaker Detection And Localization Via Microphone Array (2023)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00