Awesome Multimodal

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🎯Research Radar 🔖Saved

← all topics overview

Video-Language

loading…

Stay Updated

E-Mail Digest 🎯 Research Radar

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Video-Language — curated papers, datasets & benchmarks · Awesome Multimodal

← all topics overview

Awesome Video-Language

Video-Language is one of the most active areas in Awesome Multimodal — 5,115 papers in this collection, evaluated on datasets like LIBERO, VQA, MSCOCO. A strong starting point is "ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU".

Datasets & benchmarks

LIBERO57 papers

MSCOCO31 papers

VideoMME28 papers

MSR-VTT27 papers

nuScenes27 papers

NExT-QA21 papers

REVERIE20 papers

MMBench20 papers

Key papers

60 papers · trending (default)numbers = 🔥 heat

ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU (2026)
Fan Jiang et al.
17.67
Orca: The World is in Your Mind (2026)
Yihao Wang et al.
14.63
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)
Baochang Ren et al.
13.47
Urban Socio-Semantic Segmentation with Vision-Language Reasoning (2026)
Yu Wang et al.
12.47
Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports (2024)
Haopeng Li et al.
12.40
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval (2026)
Issar Tzachor et al.
12.05
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models (2026)
Zhengyang Sun et al.
12.00
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models (2026)
Yueyi Sun et al.
11.97
M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks (2026)
Jie Huang et al.
11.36
VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)
Senqiao Yang et al.
11.27
VisualClaw: A Real-Time, Personalized Agent for the Physical World (2026)
Haoqin Tu et al.
11.11
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation (2026)
Jie Zhang et al.
11.02
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)
Huichao Zhang et al.
10.55
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)
Xinyue Cai et al.
10.48
Cross-Modal Retrieval for Motion and Text via DropTriple Loss (2023)
Sheng Yan et al.
10.35
The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation (2026)
Chenyu Mu et al.
10.30
MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data (2026)
Zongxia Li et al.
10.30
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning (2026)
Chi-Pin Huang et al.
10.26
RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval (2026)
Tyler Skow et al.
10.25
AURA: Always-On Understanding and Real-Time Assistance via Video Streams (2026)
Xudong Lu et al.
10.23
Dense Video Captioning Using Graph-based Sentence Summarization (2025)
Zhiwang Zhang, Dong Xu, Wanli Ouyang, et al.
10.19
PEEK: Picking Essential frames via Efficient Knowledge distillation (2026)
Killian Steunou et al.
10.18
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding (2026)
Ruoliu Yang et al.
10.13
Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction (2026)
Sunqi Fan et al.
10.13
Human Motion Video Generation: A Survey (2025)
Haiwei Xue, Xiangyang Luo, Zhanghao Hu, et al.
10.08
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator (2026)
Luozheng Qin and Jia Gong and Qian Qiao and Tianjiao Li and Li Xu and Haoyu Pan and Chao Qu and Zhiyu Tan and Hao Li
9.91
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining (2026)
Juncheng Ma et al.
9.62
Watch Before You Answer: Learning from Visually Grounded Post-Training (2026)
Yuxuan Zhang et al.
9.54
Show, Tell And Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization (2025)
Zhiwang Zhang, Dong Xu, Wanli Ouyang, et al.
9.54
Native Active Perception as Reasoning for Omni-Modal Understanding (2026)
Zhenghao Xing et al.
9.48
Vision-language-action Models For Robotics: A Review Towards Real-world Applications (2025)
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, et al.
9.48
GutenOCR: A Grounded Vision-Language Front-End for Documents (2026)
Hunter Heidenreich et al.
9.43
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models (2026)
Chongyang Zhao et al.
9.43
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)
Shuming Liu et al.
9.37
Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)
Siddharth Srivastava, Gaurav Sharma
9.34
VINO: A Unified Visual Generator with Interleaved OmniModal Context (2026)
Junyi Chen et al.
8.99
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models (2025)
Qingqing Zhao et al.
8.89
Go To Zero: Towards Zero-shot Motion Generation With Million-scale Data (2025)
Ke Fan, Shunlin Lu, Minyue Dai, et al.
8.86
Information-theoretic Graph Fusion With Vision-language-action Model For Policy Reasoning And Dual Robotic Control (2025)
Shunlei Li, Longsen Gao, Jin Wang, et al.
8.81
Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos (2026)
Sreyan Ghosh et al.
8.80
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation (2026)
Boyang Wang et al.
8.53
Stateful Visual Encoders for Vision-Language Models (2026)
Zirui Wang et al.
8.46
VLS: Steering Pretrained Robot Policies via Vision-Language Models (2026)
Shuo Liu et al.
8.41
Small Vision-Language Models are Smart Compressors for Long Video Understanding (2026)
Junjie Fei et al.
8.32
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model (2026)
Jingwen Sun et al.
8.11
Future Optical Flow Prediction Improves Robot Control & Video Generation (2026)
Kanchana Ranasinghe et al.
8.05
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model (2026)
Haichao Zhang et al.
8.05
Vision-language Modeling Meets Remote Sensing: Models, Datasets And Perspectives (2025)
Xingxing Weng, Chao Pang, Gui-Song Xia
7.88
Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue (2026)
Nan Li et al.
7.73
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models (2025)
Zhenwei Shao et al.
7.71
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models (2026)
Dasol Choi et al.
7.70
Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)
Leqi Shen, Guoqiang Gong, Tianxiang Hao, et al.
7.63
Attention-based Transformer Models For Image Captioning Across Languages: An In-depth Survey And Evaluation (2025)
Israa A. Albadarneh, Bassam H. Hammo, Omar S. Al-Kadi
7.61
Object Detection With Multimodal Large Vision-language Models: An In-depth Review (2025)
Ranjan Sapkota, Manoj Karkee
7.61
Typhoon OCR: Open Vision-Language Model For Thai Document Extraction (2026)
Surapon Nonesung et al.
7.57
Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning (2026)
Chengzu Li et al.
7.57
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone (2025)
Jiacheng Ye et al.
7.53
Qwen3-VL Technical Report (2025)
Shuai Bai et al.
7.41
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)
Kevin Qu et al.
7.40
Mca-llava: Manhattan Causal Attention For Reducing Hallucination In Large Vision-language Models (2025)
Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, et al.
7.37