Phonology-guided Speech-to-speech Translation For African Languages
2024 Β· Peter Ochieng, Dennis Kaburu
Abstract
We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech *without* transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6\{,\}000-hour East African news corpus spanning five languages, we show that *within-phylum* language pairs exhibit 30--40% lower pause variance and over 3\(\times\) higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf\{SPaDA\}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment \(F_1\) by +3--4 points and eliminates up to 38% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf\{SegUniDiff\}, a diffusion-based S2ST model guided by *external gradients* from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) fr
Authors
(none)
Tags
Stats
Related papers
- Textless Speech-to-speech Translation With Limited Parallel Data (2023)3.58
- A Unit-based System And Dataset For Expressive Direct Speech-to-speech Translation (2025)2.26
- Transpeech: Speech-to-speech Translation With Bilateral Perturbation (2022)0.00
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Textless Speech-to-speech Translation On Real Data (2021)13.65
- Language Translation, And Change Of Accent For Speech-to-speech Task Using Diffusion Model (2025)0.00