Gelina: Unified Speech And Gesture Synthesis Via Interleaved Token Prediction
2026 Β· Teo Guichoux, Theodor Lemerle, Shivam Mehta, et al.
Abstract
arXiv:2510.12834v4 Announce Type: replace Abstract: Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.
Authors
(none)
Tags
Stats
Related papers
- A Conversational Gesture Synthesis System Based On Emotions And Semantics (2025)0.00
- Unified Speech And Gesture Synthesis Using Flow Matching (2023)5.24
- Fake It To Make It: Using Synthetic Data To Remedy The Data Shortage In Joint Multimodal Speech-and-gesture Synthesis (2024)6.34
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07
- Dim-gesture: Co-speech Gesture Generation With Adaptive Layer Normalization Mamba-2 Framework (2024)2.26
- Speech2affectivegestures: Synthesizing Co-speech Gestures With Generative Adversarial Affective Expression Learning (2021)14.35
- Emotiongesture: Audio-driven Diverse Emotional Co-speech 3D Gesture Generation (2023)10.97
- GELP: Gan-excited Linear Prediction For Speech Synthesis From Mel-spectrogram (2019)10.74