Speech-to-speech Translation With Discrete-unit-based Style Transfer
2023 Β· Yongqi Wang, Jionghao Bai, Rongjie Huang, et al.
Abstract
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .
Authors
(none)
Tags
Stats
Related papers
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- Transpeech: Speech-to-speech Translation With Bilateral Perturbation (2022)0.00
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- MSLM-S2ST: A Multitask Speech Language Model For Textless Speech-to-speech Translation With Speaker Style Preservation (2024)0.00
- A Unit-based System And Dataset For Expressive Direct Speech-to-speech Translation (2025)2.26
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00