Emotivetalk: Expressive Talking Head Generation Through Audio Information Decoupling And Emotional Video Diffusion
2024 Β· Haotian Wang, Yuzhe Weng, Yueyan Li, et al.
Abstract
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module
Authors
(none)
Tags
Stats
Related papers
- Emogene: Audio-driven Emotional 3D Talking-head Generation (2024)2.26
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- Diffusiontalker: Efficient And Compact Speech-driven 3D Talking Head Via Personalizer-guided Distillation (2025)5.05
- Controllable Expressive 3D Facial Animation Via Diffusion In A Unified Multimodal Space (2025)0.00
- REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation Via Id-context Caching And Asynchronous Streaming Distillation (2025)0.00
- Speech Driven Talking Face Generation From A Single Image And An Emotion Condition (2020)0.00
- Probtalk3d: Non-deterministic Emotion Controllable Speech-driven 3D Facial Animation Synthesis Using VQ-VAE (2024)11.53
- Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation (2024)0.00