Ksdiff: Keyframe-augmented Speech-aware Dual-path Diffusion For Facial Animation
2025 Β· Tianle Lyu, Junchuan Zhao, Ye Wang
Abstract
Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-th
Authors
(none)
Tags
Stats
Related papers
- Facediffuser: Speech-driven 3D Facial Animation Synthesis Using Diffusion (2023)13.79
- Df-3dface: One-to-many Speech Synchronized 3D Face Animation With Diffusion (2023)0.00
- Keyface: Expressive Audio-driven Facial Animation For Long Sequences Via Keyframe Interpolation (2025)4.52
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Diffusiontalker: Efficient And Compact Speech-driven 3D Talking Head Via Personalizer-guided Distillation (2025)5.05
- Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation (2024)0.00