Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation
2024 Β· Junming Chen, Yunfei Liu, Jianan Wang, et al.
Abstract
We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms
Authors
(none)
Tags
Stats
Related papers
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Expgest: Expressive Speaker Generation Using Diffusion Model And Hybrid Audio-text Guidance (2024)4.52
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07
- A Conversational Gesture Synthesis System Based On Emotions And Semantics (2025)0.00
- Facediffuser: Speech-driven 3D Facial Animation Synthesis Using Diffusion (2023)13.79
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24
- Freetalker: Controllable Speech And Text-driven Gesture Generation Based On Diffusion Models For Enhanced Speaker Naturalness (2024)9.59
- Df-3dface: One-to-many Speech Synchronized 3D Face Animation With Diffusion (2023)0.00