Hallo3: Highly Dynamic And Realistic Portrait Image Animation With Video Diffusion Transformer
2024 Β· Jiahao Cui, Hui Li, Yun Zhan, et al.
Abstract
Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of contin
Authors
(none)
Tags
Stats
Related papers
- FLOAT: Generative Motion Latent Flow Matching For Audio-driven Talking Portrait (2024)0.00
- See The Speaker: Crafting High-resolution Talking Faces From Speech With Prior Guidance And Region Refinement (2025)0.00
- Emogene: Audio-driven Emotional 3D Talking-head Generation (2024)2.26
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Transformer-s2a: Robust And Efficient Speech-to-animation (2021)8.35
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24