Speech-ft: Merging Pre-trained And Fine-tuned Speech Representation Models For Cross-task Generalization
2026 Β· Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, et al.
Abstract
arXiv:2502.12672v4 Announce Type: replace-cross Abstract: Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and Wa
Authors
(none)
Tags
Stats
Related papers
- Ms-hubert: Mitigating Pre-training And Inference Mismatch In Masked Language Modelling Methods For Learning Speech Representations (2024)4.52
- Less Forgetting For Better Generalization: Exploring Continual-learning Fine-tuning Methods For Speech Self-supervised Representations (2024)0.00
- SCORE: Self-supervised Correspondence Fine-tuning For Improved Content Representations (2024)0.00
- Self-supervised Rewiring Of Pre-trained Speech Encoders: Towards Faster Fine-tuning With Less Labels In Speech Processing (2022)3.58
- Application Of Knowledge Distillation To Multi-task Speech Representation Learning (2022)2.26
- Masked Modeling Duo For Speech: Specializing General-purpose Audio Representation To Speech Using Denoising Distillation (2023)7.94
- Distance-based Weight Transfer From Near-field To Far-field Speaker Verification (2023)0.00
- Efficient Emotion And Speaker Adaptation In Llm-based TTS Via Characteristic-specific Partial Fine-tuning (2025)0.00