Omniflow: Any-to-any Generation With Multi-modal Rectified Flows
2024 Β· Shufan Li, Konstantinos Kallidromitis, Akash Gokul, et al.
Abstract
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectifie
Authors
(none)
Tags
Stats
Related papers
- Next-omni: Towards Any-to-any Omnimodal Foundation Models With Discrete Flow Matching (2025)0.00
- Flowbind: Efficient Any-to-any Generation With Bidirectional Flows (2025)0.00
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50
- Flashaudio: Rectified Flows For Fast And High-fidelity Text-to-audio Generation (2024)5.13
- Flowavenet : A Generative Flow For Raw Audio (2018)0.00
- Generative Pre-training For Speech With Flow Matching (2023)0.00
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00