Speech-ft: Merging Pre-trained And Fine-tuned Speech Representation Models For Cross-task Generalization

Abstract

arXiv:2502.12672v4 Announce Type: replace-cross Abstract: Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and Wa

Speech-ft: Merging Pre-trained And Fine-tuned Speech Representation Models For Cross-task Generalization

Abstract

Authors

Tags

Stats

Related papers