Awesome Video-Language
Video-Language is one of the most active areas in Awesome Multimodal β 4,446 papers in this collection, evaluated on datasets like LIBERO, nuScenes, Video-MME. A strong starting point is "DreamX-World 1.0: A General-Purpose Interactive World Model".
Datasets & benchmarks
Key papers
- DreamX-World 1.0: A General-Purpose Interactive World Model (2026)DreamX Team et al.15.33
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)Baochang Ren et al.13.59
- Urban Socio-Semantic Segmentation with Vision-Language Reasoning (2026)Yu Wang et al.12.58
- VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval (2026)Issar Tzachor et al.12.16
- Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models (2026)Heecheol Yun et al.11.70
- M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks (2026)Jie Huang et al.11.47
- VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)Senqiao Yang et al.11.27
- VisualClaw: A Real-Time, Personalized Agent for the Physical World (2026)Haoqin Tu et al.11.22
- Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation (2026)Jie Zhang et al.11.03
- NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)Huichao Zhang et al.10.66
- OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)Xinyue Cai et al.10.59
- The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation (2026)Chenyu Mu et al.10.41
- MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data (2026)Zongxia Li et al.10.41
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning (2026)Chi-Pin Huang et al.10.38
- Cross-Modal Retrieval for Motion and Text via DropTriple Loss (2023)Sheng Yan et al.10.35
- Dense Video Captioning Using Graph-based Sentence Summarization (2025)Zhiwang Zhang, Dong Xu, Wanli Ouyang, et al.10.30
- PEEK: Picking Essential frames via Efficient Knowledge distillation (2026)Killian Steunou et al.10.29
- VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding (2026)Ruoliu Yang et al.10.24
- Human Motion Video Generation: A Survey (2025)Haiwei Xue, Xiangyang Luo, Zhanghao Hu, et al.10.19
- Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator (2026)Luozheng Qin and Jia Gong and Qian Qiao and Tianjiao Li and Li Xu and Haoyu Pan and Chao Qu and Zhiyu Tan and Hao Li10.02
- Watch Before You Answer: Learning from Visually Grounded Post-Training (2026)Yuxuan Zhang et al.9.65
- Show, Tell And Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization (2025)Zhiwang Zhang, Dong Xu, Wanli Ouyang, et al.9.65
- GutenOCR: A Grounded Vision-Language Front-End for Documents (2026)Hunter Heidenreich et al.9.54
- On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models (2026)Chongyang Zhao et al.9.54
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)Shuming Liu et al.9.48
- Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)Siddharth Srivastava, Gaurav Sharma9.46
- Pushupbench: Your VLM Is Not Good At Counting Pushups (2026)Shengzhi Li, Jiarun Chen, Karun Sharma, et al.9.22
- VINO: A Unified Visual Generator with Interleaved OmniModal Context (2026)Junyi Chen et al.9.10
- Memento: Reconstruct to Remember for Consistent Long Video Generation (2026)Xuan Wei et al.9.02
- Go To Zero: Towards Zero-shot Motion Generation With Million-scale Data (2025)Ke Fan, Shunlin Lu, Minyue Dai, et al.8.97
- Vision-language-action Models For Robotics: A Review Towards Real-world Applications (2025)Kento Kawaharazuka, Jihoon Oh, Jun Yamada, et al.8.65
- RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation (2026)Boyang Wang et al.8.64
- Stateful Visual Encoders for Vision-Language Models (2026)Zirui Wang et al.8.57
- Information-theoretic Graph Fusion With Vision-language-action Model For Policy Reasoning And Dual Robotic Control (2025)Shunlei Li, Longsen Gao, Jin Wang, et al.8.55
- VLS: Steering Pretrained Robot Policies via Vision-Language Models (2026)Shuo Liu et al.8.52
- Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences (2026)Mingyang Li et al.8.49
- Small Vision-Language Models are Smart Compressors for Long Video Understanding (2026)Junjie Fei et al.8.43
- VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model (2026)Jingwen Sun et al.8.22
- ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model (2026)Haichao Zhang et al.8.16
- RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval (2026)Tyler Skow et al.8.11
- Vision-language Modeling Meets Remote Sensing: Models, Datasets And Perspectives (2025)Xingxing Weng, Chao Pang, Gui-Song Xia7.99
- MVEB: Massive Video Embedding Benchmark (2026)Adnan El Assadi et al.7.98
- Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models (2025)Zhenwei Shao et al.7.82
- What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models (2026)Dasol Choi et al.7.81
- Typhoon OCR: Open Vision-Language Model For Thai Document Extraction (2026)Surapon Nonesung et al.7.68
- Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning (2026)Chengzu Li et al.7.68
- Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone (2025)Jiacheng Ye et al.7.64
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)Kevin Qu et al.7.51
- BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities (2025)Yu Qi et al.7.41
- CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games (2025)Peng Chen et al.7.24
- Attention-based Transformer Models For Image Captioning Across Languages: An In-depth Survey And Evaluation (2025)Israa A. Albadarneh, Bassam H. Hammo, Omar S. Al-Kadi7.24
- Object Detection With Multimodal Large Vision-language Models: An In-depth Review (2025)Ranjan Sapkota, Manoj Karkee7.24
- Towards Vision-Language Geo-Foundation Model: A Survey (2024)Yue Zhou et al.7.00
- SAGE: A Visual Language Model For Anomaly Detection Via Fact Enhancement And Entropy-aware Alignment (2025)Guoxin Zang, Xue Li, Donglin di, et al.6.99
- TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention (2025)Jinhao Duan et al.6.89
- Hulu-med: A Transparent Generalist Model Towards Holistic Medical Vision-language Understanding (2025)Songtao Jiang, Yuan Wang, Sibo Song, et al.6.87
- Disasterm3: A Remote Sensing Vision-language Dataset For Disaster Damage Assessment And Response (2025)Junjue Wang, Weihao Xuan, Heli Qi, et al.6.81
- ViDiC: Video Difference Captioning (2025)Jiangtao Wu et al.6.79
- Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval (2022)Xiang Fang et al.6.77
- Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models (2024)Seyed Amir Ahmad Safavi-Naini et al.6.67