Optimizing Speech Multi-view Feature Fusion Through Conditional Computation
2025 Β· Weiqiao Shan, Yuhao Zhang, Yuchen Han, et al.
Abstract
Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
Authors
(none)
Tags
Stats
Related papers
- Exploring Effective Fusion Algorithms For Speech Based Self-supervised Learning Models (2022)0.00
- BSS-CFFMA: Cross-domain Feature Fusion And Multi-attention Speech Enhancement Network Based On Self-supervised Embedding (2024)4.52
- Fusion Of Discrete Representations And Self-augmented Representations For Multilingual Automatic Speech Recognition (2024)2.26
- Combining Spectral And Self-supervised Features For Low Resource Speech Recognition And Translation (2022)8.82
- EFFUSE: Efficient Self-supervised Feature Fusion For E2E ASR In Low Resource And Multilingual Scenarios (2023)6.34
- Fine-tuning Strategies For Faster Inference Using Speech Self-supervised Models: A Comparative Study (2023)8.35
- Simultaneous Or Sequential Training? How Speech Representations Cooperate In A Multi-task Self-supervised Learning System (2023)3.58
- Exploiting Consistency-preserving Loss And Perceptual Contrast Stretching To Boost Ssl-based Speech Enhancement (2024)6.77