Recom: Realistic Co-speech Motion Generation With Recurrent Embedded Transformer
2025 Β· Yong Xie, Yunlian Sun, Hongwen Zhang, et al.
Abstract
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences v
Authors
(none)
Tags
Stats
Related papers
- Transformer-s2a: Robust And Efficient Speech-to-animation (2021)8.35
- Rapverse: Coherent Vocals And Whole-body Motions Generations From Text (2024)0.00
- Audio Is All In One: Speech-driven Gesture Synthetics Using Wavlm Pre-trained Model (2023)0.00
- Diffmotion: Speech-driven Gesture Synthesis Using Denoising Diffusion Model (2023)9.59
- Freetalker: Controllable Speech And Text-driven Gesture Generation Based On Diffusion Models For Enhanced Speaker Naturalness (2024)9.59
- Emogene: Audio-driven Emotional 3D Talking-head Generation (2024)2.26
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07
- See The Speaker: Crafting High-resolution Talking Faces From Speech With Prior Guidance And Region Refinement (2025)0.00