SLIDE: Integrating Speech Language Model With LLM For Spontaneous Spoken Dialogue Generation
2025 Β· Haitian Lu, Gaofeng Cheng, Liuping Luo, et al.
Abstract
Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
Authors
(none)
Tags
Stats
Related papers
- PSLM: Parallel Generation Of Text And Speech With Llms For Low-latency Spoken Dialogue Systems (2024)2.26
- Align-slm: Textless Spoken Language Models With Reinforcement Learning From AI Feedback (2024)7.16
- Paralinguistics-aware Speech-empowered Large Language Models For Natural Conversation (2024)3.96
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Get Large Language Models Ready To Speak: A Late-fusion Approach For Speech Generation (2024)5.24
- Integrating Pretrained ASR And LM To Perform Sequence Generation For Spoken Language Understanding (2023)5.24