Audiox: A Unified Framework For Anything-to-audio Generation
2025 Β· Zeyue Tian, Zhaoyang Liu, Yizhu Jin, et al.
Abstract
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model ach
Authors
(none)
Tags
Stats
Related papers
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Speechx: Neural Codec Language Model As A Versatile Speech Transformer (2023)11.85
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Apollo: Unified Multi-task Audio-video Joint Generation (2026)0.00
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00