Azeros: Extending LLM To Speech With Self-generated Instruction-free Tuning
2025 Β· Yiwen Shao, Wei Liu, Jiahong Li, et al.
Abstract
Extending large language models (LLMs) to the speech domain has recently gained significant attention. A typical approach connects a pretrained LLM with an audio encoder through a projection module and trains the resulting model on large-scale, task-specific instruction-tuning datasets. However, curating such instruction-tuning data for specific requirements is time-consuming, and models trained in this manner often generalize poorly to unseen tasks. In this work, we first formulate that the strongest generalization of a speech-LLM is achieved when it is trained with Self-Generated Instruction-Free Tuning (SIFT), in which supervision signals are generated by a frozen LLM using textual representations of speech as input. Our proposed SIFT paradigm eliminates the need for collecting task-specific question-answer pairs and yields the theoretically best generalization to unseen tasks. Building upon this paradigm, we introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is train
Authors
(none)
Tags
Stats
Related papers
- Harnessing The Zero-shot Power Of Instruction-tuned Large Language Model In End-to-end Speech Recognition (2023)0.00
- Desta2: Developing Instruction-following Speech Language Model Without Speech Instruction-tuning Data (2024)8.82
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Corpus Synthesis For Zero-shot ASR Domain Adaptation Using Large Language Models (2023)5.84
- SIFT-50M: A Large-scale Multilingual Dataset For Speech Instruction Fine-tuning (2025)0.00
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- Prompting Large Language Models For Zero-shot Domain Adaptation In Speech Recognition (2023)0.00