Positive And Negative Sampling Strategies For Self-supervised Learning On Audio-video Data
2024 Β· Shanshan Wang, Soumya Tripathy, Toni Heittola, et al.
Abstract
In Self-Supervised Learning (SSL), Audio-Visual Correspondence (AVC) is a popular task to learn deep audio and video features from large unlabeled datasets. The key step in AVC is to randomly sample audio and video clips from the dataset and learn to minimize the feature distance between the positive pairs (corresponding audio-video pair) while maximizing the distance between the negative pairs (non-corresponding audio-video pairs). The learnt features are shown to be effective on various downstream tasks. However, these methods achieve subpar performance when the size of the dataset is rather small. In this paper, we investigate the effect of utilizing class label information in the AVC feature learning task. We modified various positive and negative data sampling techniques of SSL based on class label information to investigate the effect on the feature quality. We propose a new sampling approach which we call soft-positive sampling, where the positive pair for one audio sample is no
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Learning Of Audio Representations Using Angular Contrastive Loss (2022)5.84
- On Negative Sampling For Contrastive Audio-text Retrieval (2022)0.00
- Dual Mean-teacher: An Unbiased Semi-supervised Framework For Audio-visual Source Localization (2024)5.24
- Improving Self-supervised Learning For Audio Representations By Feature Diversity And Decorrelation (2023)0.00
- Enhancing Unsupervised Audio Representation Learning Via Adversarial Sample Generation (2023)0.00
- C3-DINO: Joint Contrastive And Non-contrastive Self-supervised Learning For Speaker Verification (2022)10.21
- Deploying Self-supervised Learning In The Wild For Hybrid Automatic Speech Recognition (2022)0.00
- SLICER: Learning Universal Audio Representations Using Low-resource Self-supervised Pre-training (2022)0.00