Audio Is All In One: Speech-driven Gesture Synthetics Using Wavlm Pre-trained Model
2023 Β· Fan Zhang, Naye Ji, Fuxing Gao, et al.
Abstract
The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acou
Authors
(none)
Tags
Stats
Related papers
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07
- A Conversational Gesture Synthesis System Based On Emotions And Semantics (2025)0.00
- Expgest: Expressive Speaker Generation Using Diffusion Model And Hybrid Audio-text Guidance (2024)4.52
- Emotiongesture: Audio-driven Diverse Emotional Co-speech 3D Gesture Generation (2023)10.97
- Freetalker: Controllable Speech And Text-driven Gesture Generation Based On Diffusion Models For Enhanced Speaker Naturalness (2024)9.59
- Dim-gesture: Co-speech Gesture Generation With Adaptive Layer Normalization Mamba-2 Framework (2024)2.26
- Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation (2024)0.00