Omnicustom: Sync Audio-video Customization Via Joint Audio-video Generation Model
2026 Β· Maomao Li, Zhen Li, Kaipeng Zhang, et al.
Abstract
Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image \(I^\{r\}\) and a reference audio \(A^\{r\}\), this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are
Authors
(none)
Tags
Stats
Related papers
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Audio-sync Video Generation With Multi-stream Temporal Control (2025)0.00
- Edityourself: Audio-driven Generation And Manipulation Of Talking Head Videos With Diffusion Transformers (2026)0.00
- A Simple But Strong Baseline For Sounding Video Generation: Effective Adaptation Of Audio And Video Diffusion Models For Joint Generation (2024)3.58
- Syncflow: Toward Temporally Aligned Joint Audio-video Generation From Text (2024)0.00
- Apollo: Unified Multi-task Audio-video Joint Generation (2026)0.00
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00