SCORE: Self-supervised Correspondence Fine-tuning For Improved Content Representations
2024 Β· Amit Meghanani, Thomas Hain
Abstract
There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.
Authors
(none)
Tags
Stats
Related papers
- LASER: Learning By Aligning Self-supervised Representations Of Speech For Improving Content-related Tasks (2024)4.52
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00
- Fine-tuning Strategies For Faster Inference Using Speech Self-supervised Models: A Comparative Study (2023)8.35
- Unsupervised Fine-tuning Data Selection For ASR Using Self-supervised Speech Models (2022)5.84
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Automatic Pronunciation Assessment Using Self-supervised Speech Representation Learning (2022)0.00
- Star: Distilling Speech Temporal Relation For Lightweight Speech Self-supervised Learning Models (2023)5.24
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00