Combining Spectral And Self-supervised Features For Low Resource Speech Recognition And Translation
2022 Β· Dan Berrebbi, Jiatong Shi, Brian Yan, et al.
Abstract
Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the rela
Authors
(none)
Tags
Stats
Related papers
- Exploring Effective Fusion Algorithms For Speech Based Self-supervised Learning Models (2022)0.00
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- Fusion Of Discrete Representations And Self-augmented Representations For Multilingual Automatic Speech Recognition (2024)2.26
- EFFUSE: Efficient Self-supervised Feature Fusion For E2E ASR In Low Resource And Multilingual Scenarios (2023)6.34
- Deploying Self-supervised Learning In The Wild For Hybrid Automatic Speech Recognition (2022)0.00
- How To Learn A New Language? An Efficient Solution For Self-supervised Learning Models Unseen Languages Adaption In Low-resource Scenario (2024)0.00
- Fine-tuning Strategies For Faster Inference Using Speech Self-supervised Models: A Comparative Study (2023)8.35
- Analyzing The Factors Affecting Usefulness Of Self-supervised Pre-trained Representations For Speech Recognition (2022)0.00