Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data
2025 Β· Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, et al.
Abstract
Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data c
Authors
(none)
Tags
Stats
Related papers
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- Desta2: Developing Instruction-following Speech Language Model Without Speech Instruction-tuning Data (2024)8.82
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77
- Style Attuned Pre-training And Parameter Efficient Fine-tuning For Spoken Language Understanding (2020)6.77
- LESS: Large Language Model Enhanced Semi-supervised Learning For Speech Foundational Models Using In-the-wild Data (2025)0.00
- Speech Recognition With Llms Adapted To Disordered Speech Using Reinforcement Learning (2024)5.24
- A Study On The Integration Of Pre-trained SSL, ASR, LM And SLU Models For Spoken Language Understanding (2022)8.09