Seamlessexpressivelm: Speech Language Model For Expressive Speech-to-speech Translation With Chain-of-thought
2024 Β· Hongyu Gong, Bandhav Veluri
Abstract
Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and style transfer, meanw
Authors
(none)
Tags
Stats
Related papers
- MSLM-S2ST: A Multitask Speech Language Model For Textless Speech-to-speech Translation With Speaker Style Preservation (2024)0.00
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- A Unit-based System And Dataset For Expressive Direct Speech-to-speech Translation (2025)2.26
- Language Translation, And Change Of Accent For Speech-to-speech Task Using Diffusion Model (2025)0.00
- Chain-of-thought Prompting For Speech Translation (2024)6.34