SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation
2025 Β· Yuxuan Hu, Haibin Wu, Ruchao Fan, et al.
Abstract
Speech-aware language models (LMs) have demonstrated capabilities in understanding spoken language while generating text-based responses. However, enabling them to produce speech output efficiently and effectively remains a challenge. In this paper, we present SLM-S2ST, a multimodal LM for direct speech-to-speech translation (S2ST), built on the open-source Phi4-MM model. SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis. Our experimental results on the CVSS-C dataset demonstrate SLM-S2ST's superior performance, significantly surpassing existing baseline models trained on the same dataset. Furthermore, when we scale up the training data and the model size, SLM-S2ST reaches on-par performance with the current SOTA model.
Authors
(none)
Tags
Stats
Related papers
- MSLM-S2ST: A Multitask Speech Language Model For Textless Speech-to-speech Translation With Speaker Style Preservation (2024)0.00
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58
- Seamlessexpressivelm: Speech Language Model For Expressive Speech-to-speech Translation With Chain-of-thought (2024)0.00
- S2st-omni: Hierarchical Language-aware Speechllm Adaptation For Multilingual Speech-to-speech Translation (2025)0.00
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Streamspeech: Simultaneous Speech-to-speech Translation With Multi-task Learning (2024)7.81
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- A Unit-based System And Dataset For Expressive Direct Speech-to-speech Translation (2025)2.26