REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation Via Id-context Caching And Asynchronous Streaming Distillation
2025 Β· Haotian Wang, Yuzhe Weng, Jun Du, et al.
Abstract
Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in
Authors
(none)
Tags
Stats
Related papers
- Emotivetalk: Expressive Talking Head Generation Through Audio Information Decoupling And Emotional Video Diffusion (2024)0.00
- Soulx-flashtalk: Real-time Infinite Streaming Of Audio-driven Avatars Via Self-correcting Bidirectional Distillation (2025)0.00
- Diffusiontalker: Efficient And Compact Speech-driven 3D Talking Head Via Personalizer-guided Distillation (2025)5.05
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- Syncdiff: Diffusion-based Talking Head Synthesis With Bottlenecked Temporal Visual Prior For Improved Synchronization (2025)4.52
- Real-time Streamable Generative Speech Restoration With Flow Matching (2025)0.00
- Facediffuser: Speech-driven 3D Facial Animation Synthesis Using Diffusion (2023)13.79
- Diffsheg: A Diffusion-based Approach For Real-time Speech-driven Holistic 3D Expression And Gesture Generation (2024)0.00