SLICER: Learning Universal Audio Representations Using Low-resource Self-supervised Pre-training
2022 Β· Ashish Seth, Sreyan Ghosh, S. Umesh, et al.
Abstract
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of
Authors
(none)
Tags
Stats
Related papers
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- An Adapter Based Pre-training For Efficient And Scalable Self-supervised Speech Representation Learning (2021)8.35
- Non-contrastive Self-supervised Learning For Utterance-level Information Extraction From Speech (2022)9.59
- Improving Self-supervised Learning For Audio Representations By Feature Diversity And Decorrelation (2023)0.00
- A Pre-training Framework That Encodes Noise Information For Speech Quality Assessment (2024)3.58
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00
- Weakly-supervised Speech Pre-training: A Case Study On Target Speech Recognition (2023)8.09