Modality-independent Teachers Meet Weakly-supervised Audio-visual Event Parser
2023 Β· Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang
Abstract
Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Learning For Audio-visual Video Parsing (2021)5.84
- Investigating Modality Bias In Audio Visual Video Parsing (2022)0.00
- Label-anticipated Event Disentanglement For Audio-visual Video Parsing (2024)8.60
- Dual Mean-teacher: An Unbiased Semi-supervised Framework For Audio-visual Source Localization (2024)5.24
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- VALOR: Vision-audio-language Omni-perception Pretraining Model And Dataset (2023)10.61
- Audio-visual Event Localization On Portrait Mode Short Videos (2025)0.00