BYOL For Audio: Self-supervised Learning For General-purpose Audio Representation
2021 Β· Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, et al.
Abstract
Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks. Extensive ablation studies also
Authors
(none)
Tags
Stats
Related papers
- Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining (2023)0.00
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Conformer-based Self-supervised Learning For Non-speech Audio Tasks (2021)7.50
- Automatic Data Augmentation Selection And Parametrization In Contrastive Self-supervised Speech Representation Learning (2022)5.24
- Audio ALBERT: A Lite BERT For Self-supervised Learning Of Audio Representation (2020)15.54
- Enhancing Unsupervised Audio Representation Learning Via Adversarial Sample Generation (2023)0.00
- Learning Self-supervised Audio-visual Representations For Sound Recommendations (2024)2.26
- Exploring Efficient-tuned Learning Audio Representation Method From Brivl (2023)0.00