A Unit-based System And Dataset For Expressive Direct Speech-to-speech Translation
2025 Β· Anna Min, Chenxu Hu, Yi Ren, et al.
Abstract
Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.
Authors
(none)
Tags
Stats
Related papers
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- Textless Speech-to-speech Translation On Real Data (2021)13.65