Awesome Vision-Language Models
Vision-Language Models is one of the most active areas in Awesome Multimodal β 8,639 papers in this collection, evaluated on datasets like LIBERO, MS COCO, VQA. A strong starting point is "InterleaveThinker: Reinforcing Agentic Interleaved Generation".
Datasets & benchmarks
Key papers
- InterleaveThinker: Reinforcing Agentic Interleaved Generation (2026)Dian Zheng et al.14.38
- VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models (2026)Sen Xu et al.14.06
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)Baochang Ren et al.13.59
- STEP3-VL-10B Technical Report (2026)Ailin Huang et al.13.07
- Retrieval-Augmented Generation for Natural Language Processing: A Survey (2026)Shangyu Wu et al.12.85
- Representation Forcing for Bottleneck-Free Unified Multimodal Models (2026)Yuqing Wang et al.12.78
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning (2026)Yu Wang et al.12.58
- Orchestra-o1: Omnimodal Agent Orchestration (2026)Fan Zhang et al.12.21
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)Yu Zeng et al.12.05
- Geometric Action Model for Robot Policy Learning (2026)Jisang Han et al.11.96
- FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios (2026)Xiangru Jian et al.11.70
- Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models (2026)Heecheol Yun et al.11.70
- VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)Senqiao Yang et al.11.27
- VisualClaw: A Real-Time, Personalized Agent for the Physical World (2026)Haoqin Tu et al.11.22
- Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning (2026)Lei Zhang et al.11.11
- Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation (2026)Jie Zhang et al.11.03
- SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning (2026)Haoyu Huang et al.10.74
- Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception (2026)Lai Wei et al.10.72
- NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)Huichao Zhang et al.10.66
- FASTER: Rethinking Real-Time Flow VLAs (2026)Yuxiang Lu et al.10.64
- MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data (2026)Zongxia Li et al.10.41
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning (2025)Zhibin Lan et al.10.31
- BadWorld: Adversarial Attacks on World Models (2026)Linghui Shen et al.10.21
- InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing (2026)Changyao Tian et al.10.20
- Think3D: Thinking with Space for Spatial Reasoning (2026)Zaibin Zhang et al.10.09
- Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator (2026)Luozheng Qin and Jia Gong and Qian Qiao and Tianjiao Li and Li Xu and Haoyu Pan and Chao Qu and Zhiyu Tan and Hao Li10.02
- Text-Vision Co-Instructed Image Editing (2026)Chenxi Xie et al.9.90
- VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (2026)Zirui Wang et al.9.71
- SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL (2026)Lijun Liu et al.9.65
- GutenOCR: A Grounded Vision-Language Front-End for Documents (2026)Hunter Heidenreich et al.9.54
- On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models (2026)Chongyang Zhao et al.9.54
- Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)Siddharth Srivastava, Gaurav Sharma9.46
- Efficient Multimodal Large Language Models: A Survey (2024)Yizhang Jin et al.9.43
- Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training (2026)Peng Sun et al.9.41
- UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer (2026)Shuai Wang et al.9.34
- Pushupbench: Your VLM Is Not Good At Counting Pushups (2026)Shengzhi Li, Jiarun Chen, Karun Sharma, et al.9.22
- RepWAM: World Action Modeling with Representation Visual-Action Tokenizers (2026)Junke Wang et al.9.19
- VINO: A Unified Visual Generator with Interleaved OmniModal Context (2026)Junyi Chen et al.9.10
- TACFN: Transformer-based Adaptive Cross-modal Fusion Network For Multimodal Emotion Recognition (2025)Feng Liu, Ziwang Fu, Yunlong Wang, et al.9.10
- X-SAM: From Segment Anything to Any Segmentation (2025)Hao Wang et al.9.02
- Multimodal Fake News Detection: MFND Dataset And Shallow-deep Multitask Learning (2025)Ye Zhu, Yunan Wang, Zitong Yu9.01
- DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies (2025)Wei Song et al.9.00
- OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent (2026)Bowen Yang et al.8.96
- Waveformer: Frequency-time Decoupled Vision Modeling With Wave Equation (2026)Zishan Shu, Juntong Wu, Wei Yan, et al.8.91
- RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space (2026)Xichen Pan et al.8.85
- OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding (2026)Sheng-Yu Huang et al.8.81
- A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 (2026)Xingjun Ma et al.8.81
- Vision-language-action Models For Robotics: A Review Towards Real-world Applications (2025)Kento Kawaharazuka, Jihoon Oh, Jun Yamada, et al.8.65
- RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation (2026)Boyang Wang et al.8.64
- Adaptclip: Adapting CLIP For Universal Visual Anomaly Detection (2025)Bin-Bin Gao, Yue Zhou, Jiangtao Yan, et al.8.59
- Stateful Visual Encoders for Vision-Language Models (2026)Zirui Wang et al.8.57
- Information-theoretic Graph Fusion With Vision-language-action Model For Policy Reasoning And Dual Robotic Control (2025)Shunlei Li, Longsen Gao, Jin Wang, et al.8.55
- Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences (2026)Mingyang Li et al.8.49
- Open-Vocabulary Octree-Graph for 3D Scene Understanding (2024)Zhigang Wang et al.8.44
- Small Vision-Language Models are Smart Compressors for Long Video Understanding (2026)Junjie Fei et al.8.43
- Medgemma Technical Report (2025)Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, et al.8.35
- VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model (2026)Jingwen Sun et al.8.22
- Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)? (2026)Yue Zhang et al.8.18
- Towards Pixel-Level VLM Perception via Simple Points Prediction (2026)Tianhui Song et al.8.16
- ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model (2026)Haichao Zhang et al.8.16