Soulx-flashtalk: Real-time Infinite Streaming Of Audio-driven Avatars Via Self-correcting Bidirectional Distillation
2025 Β· Le Shen, Qian Qiao, Tan Yu, et al.
Abstract
Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf\{SoulX-FlashTalk\}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf\{Self-correcting Bidirectional Distillation\} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf\{Multi-step Retrospective Self-Correction Mechanism\}, enabling the model to autonomously recover from accum
Authors
(none)
Tags
Stats
Related papers
- REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation Via Id-context Caching And Asynchronous Streaming Distillation (2025)0.00
- FADA: Fast Diffusion Avatar Synthesis With Mixed-supervised Multi-cfg Distillation (2024)2.26
- Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation (2025)0.00
- Diffusiontalker: Efficient And Compact Speech-driven 3D Talking Head Via Personalizer-guided Distillation (2025)5.05
- FLOAT: Generative Motion Latent Flow Matching For Audio-driven Talking Portrait (2024)0.00
- Flashaudio: Rectified Flows For Fast And High-fidelity Text-to-audio Generation (2024)5.13
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- Emotivetalk: Expressive Talking Head Generation Through Audio Information Decoupling And Emotional Video Diffusion (2024)0.00