Towards Contrastive Learning In Music Video Domain
2023 · Karel Veldkamp, Mariya Hendriksen, Zoltán Szlávik, et al.
Abstract
Contrastive learning is a powerful way of learning multimodal representations across various domains such as image-caption retrieval and audio-visual representation learning. In this work, we investigate if these findings generalize to the domain of music videos. Specifically, we create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss. For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song Dataset, and evaluate the quality of learned representations on the downstream tasks of music tagging and genre classification. Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks. To gain a better understanding of the reasons contrastive learning was not successful for music videos, we perform a qualitative analysis of the learned representations, revealing why contrastive learning might
Authors
(none)
Tags
Stats
Related papers
- Sequential Contrastive Audio-visual Learning (2024)5.84
- Learning Video Representations Using Contrastive Bidirectional Transformer (2019)0.00
- Augment, Drop & Swap: Improving Diversity In LLM Captions For Efficient Music-text Representation Learning (2024)0.00
- Unsupervised Voice-face Representation Learning By Cross-modal Prototype Contrast (2022)10.35
- Cross-modal Contrastive Representation Learning For Audio-to-image Generation (2022)0.00
- Large-scale Contrastive Language-audio Pretraining With Feature Fusion And Keyword-to-caption Augmentation (2022)19.60
- Enhancing Gan-based Vocoders With Contrastive Learning Under Data-limited Condition (2023)3.58
- Collap: Contrastive Long-form Language-audio Pretraining With Musical Temporal Structure Augmentation (2024)3.58