Towards Efficient Speech-text Jointly Decoding Within One Speech Language Model
2025 Β· Haibin Wu, Yuxuan Hu, Ruchao Fan, et al.
Abstract
Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.
Authors
(none)
Tags
Stats
Related papers
- What Makes A Good Speech Tokenizer For Llm-centric Speech Generation? A Systematic Study (2025)0.00
- On Decoder-only Architecture For Speech-to-text And Large Language Model Integration (2023)0.00
- TASTE: Text-aligned Speech Tokenization And Embedding For Spoken Language Modeling (2025)0.00
- PSLM: Parallel Generation Of Text And Speech With Llms For Low-latency Spoken Dialogue Systems (2024)2.26
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03