Awesome Benchmarks
Benchmarks is one of the most active areas in Awesome Multimodal β 4,489 papers in this collection, evaluated on datasets like LIBERO, POPE, MMMU. A strong starting point is "DreamX-World 1.0: A General-Purpose Interactive World Model".
Datasets & benchmarks
Key papers
- DreamX-World 1.0: A General-Purpose Interactive World Model (2026)DreamX Team et al.15.33
- InterleaveThinker: Reinforcing Agentic Interleaved Generation (2026)Dian Zheng et al.14.38
- VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models (2026)Sen Xu et al.14.06
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)Baochang Ren et al.13.59
- STEP3-VL-10B Technical Report (2026)Ailin Huang et al.13.07
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning (2026)Yu Wang et al.12.58
- Orchestra-o1: Omnimodal Agent Orchestration (2026)Fan Zhang et al.12.21
- VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval (2026)Issar Tzachor et al.12.16
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)Yu Zeng et al.12.05
- Geometric Action Model for Robot Policy Learning (2026)Jisang Han et al.11.96
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models (2026)Zengbin Wang et al.11.87
- FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios (2026)Xiangru Jian et al.11.70
- M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks (2026)Jie Huang et al.11.47
- Task-Focused Memorization for Multimodal Agents (2026)Tao Zou et al.11.25
- VisualClaw: A Real-Time, Personalized Agent for the Physical World (2026)Haoqin Tu et al.11.22
- Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning (2026)Lei Zhang et al.11.11
- Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation (2026)Jie Zhang et al.11.03
- SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning (2026)Haoyu Huang et al.10.74
- Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception (2026)Lai Wei et al.10.72
- TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers (2026)Bin Yu et al.10.59
- OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)Xinyue Cai et al.10.59
- The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation (2026)Chenyu Mu et al.10.41
- MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data (2026)Zongxia Li et al.10.41
- Cross-Modal Retrieval for Motion and Text via DropTriple Loss (2023)Sheng Yan et al.10.35
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning (2025)Zhibin Lan et al.10.31
- VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding (2026)Ruoliu Yang et al.10.24
- UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision (2026)Ruiyan Han et al.10.17
- Text-Vision Co-Instructed Image Editing (2026)Chenxi Xie et al.9.90
- CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks? (2026)Yuxin Zhang et al.9.73
- VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (2026)Zirui Wang et al.9.71
- SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL (2026)Lijun Liu et al.9.65
- Watch Before You Answer: Learning from Visually Grounded Post-Training (2026)Yuxuan Zhang et al.9.65
- GutenOCR: A Grounded Vision-Language Front-End for Documents (2026)Hunter Heidenreich et al.9.54
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)Shuming Liu et al.9.48
- Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)Siddharth Srivastava, Gaurav Sharma9.46
- UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer (2026)Shuai Wang et al.9.34
- Pushupbench: Your VLM Is Not Good At Counting Pushups (2026)Shengzhi Li, Jiarun Chen, Karun Sharma, et al.9.22
- RepWAM: World Action Modeling with Representation Visual-Action Tokenizers (2026)Junke Wang et al.9.19
- PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions (2026)Chenxin Li et al.9.11
- VINO: A Unified Visual Generator with Interleaved OmniModal Context (2026)Junyi Chen et al.9.10
- X-SAM: From Segment Anything to Any Segmentation (2025)Hao Wang et al.9.02
- Multimodal Fake News Detection: MFND Dataset And Shallow-deep Multitask Learning (2025)Ye Zhu, Yunan Wang, Zitong Yu9.01
- DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies (2025)Wei Song et al.9.00
- Go To Zero: Towards Zero-shot Motion Generation With Million-scale Data (2025)Ke Fan, Shunlin Lu, Minyue Dai, et al.8.97
- OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent (2026)Bowen Yang et al.8.96
- A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 (2026)Xingjun Ma et al.8.81
- ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning (2026)Sicheng Yang et al.8.47
- ViTextVQA: A Large-Scale Visual Question Answering Dataset and a Novel Multimodal Feature Fusion Method for Vietnamese Text Comprehension in Images (2024)Quan Van Nguyen et al.8.37
- Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)? (2026)Yue Zhang et al.8.18
- RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval (2026)Tyler Skow et al.8.11
- Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP (2025)Junsung Park et al.8.07
- MVEB: Massive Video Embedding Benchmark (2026)Adnan El Assadi et al.7.98
- DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning (2025)Chengxuan Qian et al.7.82
- What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models (2026)Dasol Choi et al.7.81
- Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone (2025)Jiacheng Ye et al.7.64
- Is CLIP ideal? No. Can we fix it? Yes! (2025)Raphi Kang et al.7.55
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)Kevin Qu et al.7.51
- VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction (2026)Jiarong Liang et al.7.45
- BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities (2025)Yu Qi et al.7.41
- Semirnet: A Semantic Irony Recognition Network For Multimodal Sarcasm Detection (2025)Jingxuan Zhou, Yuehao Wu, Yibo Zhang, et al.7.41