Abstract

We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech *without* transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6\{,\}000-hour East African news corpus spanning five languages, we show that *within-phylum* language pairs exhibit 30--40% lower pause variance and over 3\(\times\) higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf\{SPaDA\}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment \(F_1\) by +3--4 points and eliminates up to 38% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf\{SegUniDiff\}, a diffusion-based S2ST model guided by *external gradients* from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) fr

Authors

(none)

Tags

  • Speech Translation
  • Text-to-Speech
  • Speech Recognition

Stats

Related papers