Gelina: Unified Speech And Gesture Synthesis Via Interleaved Token Prediction

Abstract

arXiv:2510.12834v4 Announce Type: replace Abstract: Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

Gelina: Unified Speech And Gesture Synthesis Via Interleaved Token Prediction

Abstract

Authors

Tags

Stats

Related papers