Apollo: Unified Multi-task Audio-video Joint Generation
2026 Β· Jun Wang, Chunyu Qiang, Yuxin Guo, et al.
Abstract
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Apollo and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and in
Authors
(none)
Tags
Stats
Related papers
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Audiox: A Unified Framework For Anything-to-audio Generation (2025)0.00
- Video-to-audio Generation With Hidden Alignment (2024)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00