Phonology-guided Speech-to-speech Translation For African Languages

Abstract

We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech *without* transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6\{,\}000-hour East African news corpus spanning five languages, we show that *within-phylum* language pairs exhibit 30--40% lower pause variance and over 3\(\times\) higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf\{SPaDA\}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment \(F_1\) by +3--4 points and eliminates up to 38% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf\{SegUniDiff\}, a diffusion-based S2ST model guided by *external gradients* from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) fr

Phonology-guided Speech-to-speech Translation For African Languages

Abstract

Authors

Tags

Stats

Related papers