Analyzing The Factors Affecting Usefulness Of Self-supervised Pre-trained Representations For Speech Recognition
2022 Β· Ashish Seth, Lodagala V S V Durga Prasad, Sreyan Ghosh, et al.
Abstract
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings. However, the common assumption made in literature is that a considerable amount of unlabeled data is available for the same domain or language that can be leveraged for SSL pre-training, which we acknowledge is not feasible in a real-world setting. In this paper, as part of the Interspeech Gram Vaani ASR challenge, we try to study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task. We also build on the continued pre-training paradigm to study the effect of prior knowledge possessed by models trained using SSL. Extensive experiments and studies reveal that the performance of ASR systems is susceptible to the data used for SSL pre-training. Their performance improves with an increase in similarity and
Authors
(none)
Tags
Stats
Related papers
- Fine-tuning Strategies For Faster Inference Using Speech Self-supervised Models: A Comparative Study (2023)8.35
- Deploying Self-supervised Learning In The Wild For Hybrid Automatic Speech Recognition (2022)0.00
- Why Does Self-supervised Learning For Speech Recognition Benefit Speaker Recognition? (2022)10.74
- Investigating Self-supervised Learning For Speech Enhancement And Separation (2022)13.44
- Investigation Of Ensemble Features Of Self-supervised Pretrained Models For Automatic Speech Recognition (2022)9.41
- Comparing Self-supervised Learning Models Pre-trained On Human Speech And Animal Vocalizations For Bioacoustics Processing (2025)5.24
- Lebenchmark: A Reproducible Framework For Assessing Self-supervised Representation Learning From Speech (2021)11.39
- Weakly-supervised Speech Pre-training: A Case Study On Target Speech Recognition (2023)8.09