Non-contrastive Self-supervised Learning For Utterance-level Information Extraction From Speech
2022 Β· Jaejin Cho, Jes'Us Villalba, Laureano Moro-Velazquez, et al.
Abstract
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We
Authors
(none)
Tags
Stats
Related papers
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Target Speech Extraction With Pre-trained Self-supervised Learning Models (2024)9.41
- The Efficacy Of Self-supervised Speech Models For Audio Representations (2022)0.00
- A Pre-training Framework That Encodes Noise Information For Speech Quality Assessment (2024)3.58
- Automatic Pronunciation Assessment Using Self-supervised Speech Representation Learning (2022)0.00
- An Adapter Based Pre-training For Efficient And Scalable Self-supervised Speech Representation Learning (2021)8.35
- Exploring Self-supervised Multi-view Contrastive Learning For Speech Emotion Recognition With Limited Annotations (2024)3.58
- Investigating Self-supervised Learning For Speech Enhancement And Separation (2022)13.44