A Conversational Gesture Synthesis System Based On Emotions And Semantics
2025 Β· Thanh Hoang-Minh
Abstract
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on
Authors
(none)
Tags
Stats
Related papers
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07
- Emotiongesture: Audio-driven Diverse Emotional Co-speech 3D Gesture Generation (2023)10.97
- Expgest: Expressive Speaker Generation Using Diffusion Model And Hybrid Audio-text Guidance (2024)4.52
- Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation (2024)0.00
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Audio Is All In One: Speech-driven Gesture Synthetics Using Wavlm Pre-trained Model (2023)0.00
- Dim-gesture: Co-speech Gesture Generation With Adaptive Layer Normalization Mamba-2 Framework (2024)2.26
- Speech2affectivegestures: Synthesizing Co-speech Gestures With Generative Adversarial Affective Expression Learning (2021)14.35