Talkverse: Democratizing Minute-long Audio-driven Video Generation
2025 Β· Zhenzhi Wang, Jian Wang, Ke Ma, et al.
Abstract
We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10\(\times\) lower i
Authors
(none)
Tags
Stats
Related papers
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- More Than Words: In-the-wild Visually-driven Prosody For Text-to-speech (2021)9.03
- Mtavg-bench: A Comprehensive Benchmark For Evaluating Multi-talker Dialogue-centric Audio-video Generation (2026)0.00
- Text-to-video: A Two-stage Framework For Zero-shot Identity-agnostic Talking-head Generation (2023)1.69
- Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation (2025)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Deepsound-v1: Start To Think Step-by-step In The Audio Generation From Videos (2025)0.00