HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Abstract

arXiv:2508.10566v3 Announce Type: replace Abstract: Audio-driven talking head generation faces a fundamental trade-off between personalization and generalization, limiting its practical application. Implicit models often achieve generalization at the cost of structural incoherence, resulting in unstable head motion and inaccurate lip synchronization. While explicit methods incorporate geometric and anatomical priors such as 3D Morphable Models (3DMMs), which parameterize facial geometry, or Action Units (AUs), which code facial muscle movements--they tend to produce overly neutral expressions or suffer from limited generalization. To resolve this conflict, we present HM-Talker, an audio-driven talking head framework that synergistically integrates explicit articulatory cues with implicit prosodic features to characterize identity-specific dynamics while enabling audio-driven generalization. Its distinctive features can be summarized as: i) the Cross-Modal Mapping Module (CMMM) that extracts a comprehensive vocabulary of motion cues from audio and video, and ii) the Hybrid Motion Modeling Module (HMMM) that employs a Stochastic Feature Pairing (SFP) strategy to dynamically merge paired implicit and explicit features for motion synthesis. This design facilitates an iterative optimization of the lower face motion, alternating between identity-specific and identity-agnostic (audio-only) objectives. Extensive experiments demonstrate that HM-Talker outperforms state-of-the-art methods in both visual realism and lip-sync accuracy across diverse settings.

Abstract

Related papers