Efficient Emotion And Speaker Adaptation In Llm-based TTS Via Characteristic-specific Partial Fine-tuning
2025 Β· Tianrui Wang, Meng Ge, Cheng Gong, et al.
Abstract
While LLM-based TTS models exhibit zero-shot emotion and speaker cloning, their cloning fidelity and pronunciation clarity degrade on unseen domains. Fine-tuning is essential for adaptation, yet uniform approaches overlook specific parameter contributions. Uniform tuning on limited data causes slow training and catastrophic forgetting, leading to degraded pronunciation accuracy. To address this, we propose CSP-FT, a characteristic-specific partial fine-tuning strategy. By dynamically analyzing layer contributions via a weighted sum, we selectively fine-tune only the two layers capturing the most and least emotion and speaker information, maximizing the utility of the former while explicitly strengthening the capacity of the latter. Experiments on a combined corpus of 11 datasets show CSP-FT matches or exceeds the fidelity and intelligibility of full fine-tuning while updating only ~8% of parameters, accelerating training by ~2x, and significantly mitigating catastrophic forgetting.
Authors
(none)
Tags
Stats
Related papers
- EMORL-TTS: Reinforcement Learning For Fine-grained Emotion Control In Llm-based TTS (2025)0.00
- Adapter-based Extension Of Multi-speaker Text-to-speech Model For New Speakers (2022)6.77
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- Exploring Transfer Learning For Low Resource Emotional TTS (2019)0.00
- Bayesian Parameter-efficient Fine-tuning For Overcoming Catastrophic Forgetting (2024)0.00
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- Continual Speaker Adaptation For Text-to-speech Synthesis (2021)0.00
- Effective Text Adaptation For Llm-based ASR Through Soft Prompt Fine-tuning (2024)5.84