Dual Mean-teacher: An Unbiased Semi-supervised Framework For Audio-visual Source Localization
2024 Β· Yuxin Guo, Shijie Ma, Hu Su, et al.
Abstract
Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabel
Authors
(none)
Tags
Stats
Related papers
- Modality-independent Teachers Meet Weakly-supervised Audio-visual Event Parser (2023)4.77
- Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization (2023)6.77
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Learning To Unify Audio, Visual And Text For Audio-enhanced Multilingual Visual Answer Localization (2024)2.26
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Positive And Negative Sampling Strategies For Self-supervised Learning On Audio-video Data (2024)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29