Worldsense: Evaluating Real-world Omnimodal Understanding For Multimodal Llms
2025 Β· Jack Hong, Shilin Yan, Jiayin Cai, et al.
Abstract
We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i)collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii)diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii)high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The exper
Authors
(none)
Tags
Stats
Related papers
- Omhbench: Benchmarking Balanced And Grounded Omni-modal Multi-hop Reasoning (2026)0.00
- Omni-captioner: Data Pipeline, Models, And Benchmark For Omni Detailed Perception (2025)0.00
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- VALOR: Vision-audio-language Omni-perception Pretraining Model And Dataset (2023)10.61
- VAST: A Vision-audio-subtitle-text Omni-modality Foundation Model And Dataset (2023)14.55
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- MMSU: A Massive Multi-task Spoken Language Understanding And Reasoning Benchmark (2025)2.29
- Capybara-omni: An Efficient Paradigm For Building Omni-modal Language Models (2025)0.00