MIO: A Foundation Model On Multimodal Tokens
2024 Β· Zekun Wang, King Zhu, Chunpu Xu, et al.
Abstract
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3
Authors
(none)
Tags
Stats
Related papers
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- VITA: Towards Open-source Interactive Omni Multimodal LLM (2024)0.00
- Anygpt: Unified Multimodal LLM With Discrete Sequence Modeling (2024)0.00
- Training-free Multimodal Large Language Model Orchestration (2025)0.00
- Next-gpt: Any-to-any Multimodal LLM (2023)0.00
- Capybara-omni: An Efficient Paradigm For Building Omni-modal Language Models (2025)0.00
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00