Desta2: Developing Instruction-following Speech Language Model Without Speech Instruction-tuning Data
2024 Β· Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, et al.
Abstract
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-
Authors
(none)
Tags
Stats
Related papers
- Desta: Enhancing Speech Language Models Through Descriptive Speech-text Alignment (2024)9.03
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- Dynamic-superb: Towards A Dynamic, Collaborative, And Comprehensive Instruction-tuning Benchmark For Speech (2023)0.00
- Harnessing The Zero-shot Power Of Instruction-tuned Large Language Model In End-to-end Speech Recognition (2023)0.00
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Azeros: Extending LLM To Speech With Self-generated Instruction-free Tuning (2025)0.00
- SLM-TTA: A Framework For Test-time Adaptation Of Generative Spoken Language Models (2025)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31