Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing
2026 Β· Zeyue Tian, Binxin Yang, Zhaoyang Liu, et al.
Abstract
arXiv:2604.10708v2 Announce Type: replace-cross Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one milli
Authors
(none)
Tags
Stats
Related papers
- Audiox: A Unified Framework For Anything-to-audio Generation (2025)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Au-m-ol: A Unified Model For Medical Audio And Language Understanding (2026)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining (2023)0.00
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00