Ms-hubert: Mitigating Pre-training And Inference Mismatch In Masked Language Modelling Methods For Learning Speech Representations
2024 Β· Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
Abstract
In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.
Authors
(none)
Tags
Stats
Related papers
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00
- Selective Hubert: Self-supervised Pre-training For Target Speaker In Clean And Mixture Speech (2023)7.81
- Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-supervised Speech Units (2023)4.52
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- An Adapter Based Multi-label Pre-training For Speech Separation And Enhancement (2022)7.50
- Melhubert: A Simplified Hubert On Mel Spectrograms (2022)3.70
- Cocktail Hubert: Generalized Self-supervised Pre-training For Mixture And Single-source Speech (2023)6.77