EFFUSE: Efficient Self-supervised Feature Fusion For E2E ASR In Low Resource And Multilingual Scenarios
2023 Β· Tejes Srivastava, Jiatong Shi, William Chen, et al.
Abstract
Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. Our experiments show that EFFUSE outperforms individual SSL models in multilingual speech recognition tasks. Our best performing model achieves an average SUPERB score increase of 63.5 (6.3%) from the SSL baselines in Multilingual Speech Universal PERformance Benchmark (ML-SUPERB), while decreasing parameter size on average by 317M parameters (49%) from the fusion models.
Authors
(none)
Tags
Stats
Related papers
- Exploring Effective Fusion Algorithms For Speech Based Self-supervised Learning Models (2022)0.00
- Fusion Of Discrete Representations And Self-augmented Representations For Multilingual Automatic Speech Recognition (2024)2.26
- Fearless: Feature Refinement Loss For Ensembling Self-supervised Learning Features In Robust End-to-end Speech Recognition (2022)6.77
- Combining Spectral And Self-supervised Features For Low Resource Speech Recognition And Translation (2022)8.82
- Optimizing Speech Multi-view Feature Fusion Through Conditional Computation (2025)0.00
- The Efficacy Of Self-supervised Speech Models For Audio Representations (2022)0.00
- BSS-CFFMA: Cross-domain Feature Fusion And Multi-attention Speech Enhancement Network Based On Self-supervised Embedding (2024)4.52
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00