See The Speaker: Crafting High-resolution Talking Faces From Speech With Prior Guidance And Region Refinement
2025 Β· Jinting Wang, Jun Wang, Hei Victor Cheng, et al.
Abstract
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate tha
Authors
(none)
Tags
Stats
Related papers
- Speech Driven Talking Face Generation From A Single Image And An Emotion Condition (2020)0.00
- A Unified Compression Framework For Efficient Speech-driven Talking-face Generation (2023)0.00
- Diffusiontalker: Efficient And Compact Speech-driven 3D Talking Head Via Personalizer-guided Distillation (2025)5.05
- From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech (2025)0.00
- FLOAT: Generative Motion Latent Flow Matching For Audio-driven Talking Portrait (2024)0.00
- Instruct-neuraltalker: Editing Audio-driven Talking Radiance Fields With Instructions (2023)0.00
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24