Syncdiff: Diffusion-based Talking Head Synthesis With Bottlenecked Temporal Visual Prior For Improved Synchronization
2025 Β· Xulin Fan, Heting Gao, Ziyi Chen, et al.
Abstract
Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3%
Authors
(none)
Tags
Stats
Related papers
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Facediffuser: Speech-driven 3D Facial Animation Synthesis Using Diffusion (2023)13.79
- Lpips-attnwav2lip: Generic Audio-driven Lip Synchronization For Talking Head Generation In The Wild (2026)12.65
- Ksdiff: Keyframe-augmented Speech-aware Dual-path Diffusion For Facial Animation (2025)0.00
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- Emotivetalk: Expressive Talking Head Generation Through Audio Information Decoupling And Emotional Video Diffusion (2024)0.00