Awesome Visual QA & Reasoning
Visual QA & Reasoning is one of the most active areas in Awesome Multimodal β 3,868 papers in this collection, evaluated on datasets like VQA, MMMU, OK-VQA. A strong starting point is "VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models".
Datasets & benchmarks
Key papers
- VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models (2026)Sen Xu et al.13.18
- STEP3-VL-10B Technical Report (2026)Ailin Huang et al.13.07
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning (2026)Yu Wang et al.12.58
- Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models (2026)Wenxuan Huang et al.12.58
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)Yu Zeng et al.12.05
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models (2026)Zengbin Wang et al.11.87
- Task-Focused Memorization for Multimodal Agents (2026)Tao Zou et al.11.25
- VisualClaw: A Real-Time, Personalized Agent for the Physical World (2026)Haoqin Tu et al.11.22
- Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning (2026)Lei Zhang et al.11.11
- Native Active Perception as Reasoning for Omni-Modal Understanding (2026)Zhenghao Xing et al.11.01
- Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception (2026)Lai Wei et al.10.72
- OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)Xinyue Cai et al.10.59
- MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data (2026)Zongxia Li et al.10.41
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning (2026)Chi-Pin Huang et al.10.38
- VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding (2026)Ruoliu Yang et al.10.24
- InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing (2026)Changyao Tian et al.10.20
- Think3D: Thinking with Space for Spatial Reasoning (2026)Zaibin Zhang et al.10.09
- SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL (2026)Lijun Liu et al.9.65
- Watch Before You Answer: Learning from Visually Grounded Post-Training (2026)Yuxuan Zhang et al.9.65
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)Shuming Liu et al.9.48
- Efficient Multimodal Large Language Models: A Survey (2024)Yizhang Jin et al.9.43
- Reinforcing Dual-Path Reasoning in Spatial Vision Language Models (2026)Yatai Ji et al.9.18
- Medgemma Technical Report (2025)Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, et al.8.35
- Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)? (2026)Yue Zhang et al.8.18
- What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models (2026)Dasol Choi et al.7.81
- Chat-driven Text Generation And Interaction For Person Retrieval (2025)Zequn Xie, Chuxin Wang, Sihang Cai, et al.7.72
- Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning (2026)Chengzu Li et al.7.68
- Less Detail, Better Answers: Degradation-Driven Prompting for VQA (2026)Haoxuan Han et al.7.56
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)Kevin Qu et al.7.51
- VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction (2026)Jiarong Liang et al.7.45
- Mathcoder-vl: Bridging Vision And Code For Enhanced Multimodal Mathematical Reasoning (2025)Ke Wang, Junting Pan, Linda Wei, et al.7.44
- Semirnet: A Semantic Irony Recognition Network For Multimodal Sarcasm Detection (2025)Jingxuan Zhou, Yuehao Wu, Yibo Zhang, et al.7.41
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)Linquan Wu et al.7.40
- Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing (2026)Tingyu Song et al.7.40
- iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning (2026)Chang-Bin Zhang et al.7.31
- MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation (2026)Changli Wu et al.7.24
- CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models (2026)Xiangzhao Hao et al.7.04
- FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching (2026)Junchao Yi et al.7.04
- SAGE: A Visual Language Model For Anomaly Detection Via Fact Enhancement And Entropy-aware Alignment (2025)Guoxin Zang, Xue Li, Donglin di, et al.6.99
- Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks (2026)Qihua Dong et al.6.93
- Forest Before Trees: Latent Superposition for Efficient Visual Reasoning (2026)Yubo Wang et al.6.88
- Human-centered Interactive Learning Via Mllms For Text-to-image Person Re-identification (2025)Yang Qin, Chao Chen, Zhihang Fu, et al.6.86
- Capabilities Of GPT-5 On Multimodal Medical Reasoning (2025)Shansong Wang, Mingzhe Hu, Qiang Li, et al.6.86
- Disasterm3: A Remote Sensing Vision-language Dataset For Disaster Damage Assessment And Response (2025)Junjue Wang, Weihao Xuan, Heli Qi, et al.6.81
- ViDiC: Video Difference Captioning (2025)Jiangtao Wu et al.6.79
- PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models (2026)Ruizhi Zhang et al.6.61
- Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning (2025)Wenchuan Zhang et al.6.51
- Qwen3-VL Technical Report (2025)Shuai Bai et al.6.40
- Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space (2025)Chao Chen et al.6.34
- PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization (2026)Songhan Jiang et al.6.34
- Learning Situated Awareness in the Real World (2026)Chuhan Li et al.6.25
- Pdf-wukong: A Large Multimodal Model For Efficient Long PDF Reading With End-to-end Sparse Sampling (2026)Xudong Xie, Hao Yan, Liang Yin, et al.6.23
- CPPO: Contrastive Perception for Vision Language Policy Optimization (2026)Ahmad Rezaei and Mohsen Gholami and Saeed Ranjbar Alvar and Kevin Cannons and Mohammad Asiful Hossain and Zhou Weimin and Shunbo Zhou and Yong Zhang and Mohammad Akbari6.19
- Vquala 2025 Challenge On Visual Quality Comparison For Large Multimodal Models: Methods And Results (2025)Hanwei Zhu, Haoning Wu, Zicheng Zhang, et al.6.12
- Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing (2026)Runze He et al.5.91
- Render-of-thought: Rendering Textual Chain-of-thought As Images For Visual Latent Reasoning (2026)Yifan Wang, Shiyu Li, Peiming Li, et al.5.91
- Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning (2026)Yang Liu et al.5.88
- Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning (2025)Bob Zhang et al.5.87
- VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models (2025)Xinlei Yu and Chengming Xu and Guibin Zhang and Zhangquan Chen and Yudong Zhang and Yongbo He and Peng-Tao Jiang and Jiangning Zhang and Xiaobin Hu and Shuicheng Yan5.83
- Seqvlm: Proposal-guided Multi-view Sequences Reasoning Via VLM For Zero-shot 3D Visual Grounding (2025)Jiawen Lin, Shiran Bian, Yihang Zhu, et al.5.73