Omhbench: Benchmarking Balanced And Grounded Omni-modal Multi-hop Reasoning
2026 Β· Seunghee Kim, Ingyu Bang, Seokgyu Jang, et al.
Abstract
arXiv:2508.16198v3 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.
Authors
(none)
Tags
Stats
Related papers
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- Capybara-omni: An Efficient Paradigm For Building Omni-modal Language Models (2025)0.00
- MMSU: A Massive Multi-task Spoken Language Understanding And Reasoning Benchmark (2025)2.29
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Worldsense: Evaluating Real-world Omnimodal Understanding For Multimodal Llms (2025)0.00
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- MCIF: Multimodal Crosslingual Instruction-following Benchmark From Scientific Talks (2025)0.00
- VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents (2025)0.00