Talknce: Improving Active Speaker Detection With Talk-aware Contrastive Learning
2023 Β· Chaeyoung Jung, Suyeon Lee, Kihyun Nam, et al.
Abstract
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
Authors
(none)
Tags
Stats
Related papers
- Is Someone Speaking? Exploring Long-term Temporal Features For Audio-visual Active Speaker Detection (2021)21.12
- Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization (2023)6.77
- Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech (2023)7.50
- Unpaired Speech Enhancement By Acoustic And Adversarial Supervision For Speech Recognition (2018)10.21
- How To Design A Three-stage Architecture For Audio-visual Active Speaker Detection In The Wild (2021)12.10
- Speaker Representation Learning Via Contrastive Loss With Maximal Speaker Separability (2022)10.68
- Aca-net: Towards Lightweight Speaker Verification Using Asymmetric Cross Attention (2023)0.00
- Additive Margin In Contrastive Self-supervised Frameworks To Learn Discriminative Speaker Representations (2024)2.26