Audio Representation Learning By Distilling Video As Privileged Information
2023 Β· Amirhossein Hajavi, Ali Etemad
Abstract
Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and non-sequential data where the entire features are treated as one who
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Representation Learning Via Knowledge Distillation From Speech Foundation Models (2025)7.81
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Metric Learning With Progressive Self-distillation For Audio-visual Embedding Learning (2025)3.58
- BYOL For Audio: Self-supervised Learning For General-purpose Audio Representation (2021)15.22
- Modality-independent Teachers Meet Weakly-supervised Audio-visual Event Parser (2023)4.77
- Cross-modal Distillation For Widely Differing Modalities (2025)0.00
- Unified Video-language Pre-training With Synchronized Audio (2024)0.00
- Learning Music Audio Representations Via Weak Language Supervision (2021)10.07