Desta: Enhancing Speech Language Models Through Descriptive Speech-text Alignment
2024 Β· Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, et al.
Abstract
Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions.
Authors
(none)
Tags
Stats
Related papers
- Desta2: Developing Instruction-following Speech Language Model Without Speech Instruction-tuning Data (2024)8.82
- Align-slm: Textless Spoken Language Models With Reinforcement Learning From AI Feedback (2024)7.16
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- BLSP: Bootstrapping Language-speech Pre-training Via Behavior Alignment Of Continuation Writing (2023)0.00
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- TASTE: Text-aligned Speech Tokenization And Embedding For Spoken Language Modeling (2025)0.00