Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer
2024 Β· Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, et al.
Abstract
Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules. These modules serve as substitutes for the traditional self/cross-attention in standard Transformers, incorporating thoughtfully designed biases that steer the attention mechanisms to concentrate on both the relevant task-specific and diffusion-related conditions. We also explore the trade-off between accurate lip synchronization and non-verbal facial expressions within the Diffusion paradigm. Experiments show our model not only achieves
Authors
(none)
Tags
Stats
Related papers
- Facediffuser: Speech-driven 3D Facial Animation Synthesis Using Diffusion (2023)13.79
- Diffusiontalker: Efficient And Compact Speech-driven 3D Talking Head Via Personalizer-guided Distillation (2025)5.05
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Df-3dface: One-to-many Speech Synchronized 3D Face Animation With Diffusion (2023)0.00
- Ksdiff: Keyframe-augmented Speech-aware Dual-path Diffusion For Facial Animation (2025)0.00
- Syncdiff: Diffusion-based Talking Head Synthesis With Bottlenecked Temporal Visual Prior For Improved Synchronization (2025)4.52
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation (2024)0.00