M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance
2025 Β· Qingpei Guo, Kaiyou Song, Zipeng Feng, et al.
Abstract
We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring o
Authors
(none)
Tags
Stats
Related papers
- Capybara-omni: An Efficient Paradigm For Building Omni-modal Language Models (2025)0.00
- MIO: A Foundation Model On Multimodal Tokens (2024)3.58
- Omni-c: Compressing Heterogeneous Modalities Into A Single Dense Encoder (2026)0.00
- Ming-flash-omni: A Sparse, Unified Architecture For Multimodal Perception And Generation (2025)0.00
- Next-gpt: Any-to-any Multimodal LLM (2023)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Llama-omni: Seamless Speech Interaction With Large Language Models (2024)0.00
- Omhbench: Benchmarking Balanced And Grounded Omni-modal Multi-hop Reasoning (2026)0.00