Abstract
arXiv:2412.07333v2 Announce Type: replace Abstract: Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of a source image. This technology facilitates diverse applications, including virtual try-on, digital avatars, animation, and sign language generation. Despite the high-quality results of recent diffusion-based PGPIS, these models typically depend on implicit feature aggregation within the denoising process. As a result, fine-grained texture preservation is limited, and even for the same identity, it is difficult to ensure consistent generation under variations in pose and source appearance. To address these limitations, we propose Fusion Embedding for PGPIS using a Diffusion Model (FPDM), the first framework that explicitly aligns fused source-pose embeddings with target image embeddings via contrastive learning, and subsequently employs the learned fusion embedding as a conditioning signal for generation. FPDM integrates an Image-Pose Fusion (IPF) module into our proposed Source-Enhanced Pose Fusion approach to learn a fusion embedding aligned with the target image. We then employ a conditional diffusion model guided by source appearance, target pose, and the learned fusion embedding. Experiments on the DeepFashion benchmark and the RWTH-PHOENIX-Weather 2014T dataset demonstrate competitive performance compared to existing methods in both quantitative and qualitative evaluations, with ablation studies confirming that explicit fusion embedding alignment substantially improves texture fidelity and consistency across pose and source appearance variations.