Dim-gesture: Co-speech Gesture Generation With Adaptive Layer Normalization Mamba-2 Framework
2024 Β· Fan Zhang, Naye Ji, Fuxing Gao, et al.
Abstract
Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit\{DiM-Gestures\}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and th
Authors
(none)
Tags
Stats
Related papers
- A Conversational Gesture Synthesis System Based On Emotions And Semantics (2025)0.00
- Audio Is All In One: Speech-driven Gesture Synthetics Using Wavlm Pre-trained Model (2023)0.00
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Emotiongesture: Audio-driven Diverse Emotional Co-speech 3D Gesture Generation (2023)10.97
- Freetalker: Controllable Speech And Text-driven Gesture Generation Based On Diffusion Models For Enhanced Speaker Naturalness (2024)9.59
- Expgest: Expressive Speaker Generation Using Diffusion Model And Hybrid Audio-text Guidance (2024)4.52
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07
- Speech2affectivegestures: Synthesizing Co-speech Gestures With Generative Adversarial Affective Expression Learning (2021)14.35