Rosettaspeech: Zero-shot Speech-to-speech Translation Without Parallel Speech
2025 Β· Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, et al.
Abstract
End-to-end speech-to-speech translation (S2ST) systems typically struggle with a critical data bottleneck: the scarcity of parallel speech-to-speech corpora. To overcome this, we introduce RosettaSpeech, a novel zero-shot framework trained exclusively on monolingual speech-text data augmented by machine translation supervision. Unlike prior works that rely on complex cascaded pseudo-labeling, our approach strategically utilizes text as a semantic bridge during training to synthesize translation targets, thereby eliminating the need for parallel speech pairs while maintaining a direct, end-to-end inference pipeline. Empirical evaluations on the CVSS-C benchmark demonstrate that RosettaSpeech achieves state-of-the-art zero-shot performance, surpassing leading baselines by significant margins - achieving ASR-BLEU scores of 25.17 for German-to-English (+27% relative gain) and 29.86 for Spanish-to-English (+14%). Crucially, our model effectively preserves the source speaker's voice without
Authors
(none)
Tags
Stats
Related papers
- Tackling Data Scarcity In Speech Translation Using Zero-shot Multilingual Machine Translation Techniques (2022)2.26
- Textless Speech-to-speech Translation With Limited Parallel Data (2023)3.58
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Textless Speech-to-speech Translation On Real Data (2021)13.65
- Towards Unsupervised Speech-to-text Translation (2018)0.00
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- A Weakly-supervised Streaming Multilingual Speech Model With Truly Zero-shot Capability (2022)5.84