Awesome Multimodal

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🎯Research Radar 🔖Saved

← all topics overview

Instruction Tuning

loading…

Stay Updated

E-Mail Digest 🎯 Research Radar

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Instruction Tuning — curated papers, datasets & benchmarks · Awesome Multimodal

← all topics overview

Awesome Instruction Tuning

Instruction Tuning is one of the most active areas in Awesome Multimodal — 864 papers in this collection, evaluated on datasets like MME, ScienceQA, MIMIC-CXR. A strong starting point is "Boogu-Image-0.1: Boosting Open-Source Unified Multimodal Understanding and Generation".

Datasets & benchmarks

ScienceQA4 papers

MIMIC-CXR4 papers

nuScenes3 papers

MMBench3 papers

LLaVA-665K3 papers

GenEval3 papers

Key papers

60 papers · trending (default)numbers = 🔥 heat

Boogu-Image-0.1: Boosting Open-Source Unified Multimodal Understanding and Generation (2026)
Guoxuan Chen et al.
14.86
InterleaveThinker: Reinforcing Agentic Interleaved Generation (2026)
Dian Zheng et al.
14.27
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)
Baochang Ren et al.
13.47
Vision as Unified Multimodal Generation (2026)
Xiaoyang Han et al.
12.51
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)
Xinyue Cai et al.
10.48
Text-Vision Co-Instructed Image Editing (2026)
Chenxi Xie et al.
9.79
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models (2026)
Chongyang Zhao et al.
9.43
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training (2026)
Peng Sun et al.
9.30
VINO: A Unified Visual Generator with Interleaved OmniModal Context (2026)
Junyi Chen et al.
8.99
Vision-language Modeling Meets Remote Sensing: Models, Datasets And Perspectives (2025)
Xingxing Weng, Chao Pang, Gui-Song Xia
7.88
Mca-llava: Manhattan Causal Attention For Reducing Hallucination In Large Vision-language Models (2025)
Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, et al.
7.37
Mathcoder-vl: Bridging Vision And Code For Enhanced Multimodal Mathematical Reasoning (2025)
Ke Wang, Junting Pan, Linda Wei, et al.
7.33
MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets (2023)
Lai Wei et al.
6.88
Qoq-med: Building Multimodal Clinical Foundation Models With Domain-aware GRPO Training (2025)
Wei Dai, Peilin Chen, Chanakya Ekbote, et al.
6.84
Shapellm-omni: A Native Multimodal LLM For 3D Generation And Understanding (2025)
Junliang Ye, Zhengyi Wang, Ruowen Zhao, et al.
6.71
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling (2025)
Xiaokang Chen et al.
6.47
Omnigen2: Towards Instruction-aligned Multimodal Generation (2025)
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, et al.
6.32
PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization (2026)
Songhan Jiang et al.
6.23
Medical Knowledge Intervention Prompt Tuning For Medical Image Classification (2025)
Ye Du, Nanxi Yu, Shujun Wang
6.12
MIMO: A Medical Vision Language Model With Visual Referring Multimodal Input And Pixel Grounding Multimodal Output (2025)
Yanyuan Chen, Dexuan Xu, Yu Huang, et al.
6.01
Vquala 2025 Challenge On Visual Quality Comparison For Large Multimodal Models: Methods And Results (2025)
Hanwei Zhu, Haoning Wu, Zicheng Zhang, et al.
6.01
Infigui-g1: Advancing GUI Grounding With Adaptive Exploration Policy Optimization (2025)
Yuhang Liu, Zeyu Liu, Shuanghe Zhu, et al.
5.87
Hierarchical-task-aware Multi-modal Mixture Of Incremental Lora Experts For Embodied Continual Learning (2025)
Ziqi Jia, Anmin Wang, Xiaoyang Qu, et al.
5.70
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization (2024)
Fanrui Zhang et al.
5.40
Noteit: A System Converting Instructional Videos To Interactable Notes Through Multimodal Video Understanding (2025)
Running Zhao, Zhihan Jiang, Xinchen Zhang, et al.
5.35
GeoChrono: Benchmarking and Rethinking Long-Term Temporal Understanding in Remote Sensing (2026)
Yujie Li et al.
4.95
Seeing or Knowing? Visual Context Sensitivity in Multimodal Large Language Models (2026)
Jiaang Li et al.
4.95
TPCD: Tone-Pressure Contrastive Decoding and the Label-Free Gating Bottleneck in Vision-Language Models (2026)
Jinkun Zhao et al.
4.95
Decoupled Visual Processing: Efficient Multimodal Adaptation via Modality-Specific Transformer Substitution (2026)
Mingkuan Feng et al.
4.95
Progressive Multimodal Alignment for Continual Instruction Tuning (2026)
Duzhen Zhang et al.
4.95
Emovlm-kd: Fusing Distilled Expertise With Vision-language Models For Visual Emotion Analysis (2025)
Sangeun Lee, Yubeen Lee, Eunil Park
4.93
Scaling Up Biomedical Vision-language Models: Fine-tuning, Instruction Tuning, And Multi-modal Learning (2025)
Cheng Peng, Kai Zhang, Mengxian Lyu, et al.
4.93
Trajectory-Level Redirection Attacks on Vision-Language-Action Models (2026)
Gokul Puthumanaillam et al.
4.90
Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation (2025)
Niccolo Avogaro et al.
4.76
The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning (2026)
Ruobing Zheng et al.
4.70
Adaptive Markup Language Generation For Contextually-grounded Visual Document Understanding (2025)
Han Xiao, Yina Xie, Guanxin Tan, et al.
4.56
Shizhengpt: Towards Multimodal Llms For Traditional Chinese Medicine (2025)
Junying Chen, Zhenyang Cai, Zhiheng Liu, et al.
4.53
MC-LLaVA: Multi-Concept Personalized Vision-Language Model (2024)
Ruichuan An et al.
4.52
Multi-modal Multi-task (M3T) Federated Foundation Models For Embodied AI: Potentials And Challenges For Edge Integration (2025)
Kasra Borazjani, Payam Abdisarabshali, Fardis Nadimi, et al.
4.42
Collaborative Multi-lora Experts With Achievement-based Multi-tasks Loss For Unified Multimodal Information Extraction (2025)
Li Yuan, Yi Cai, Xudong Shen, et al.
4.42
Wiswheat: A Three-tiered Vision-language Dataset For Wheat Management (2025)
Bowen Yuan, Selena Song, Javier Fernandez, et al.
4.42
Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos (2025)
Markos Stamatakis et al.
4.36
iFLYTEK-Embodied-Omni Technical Report (2026)
Yuan Zhang et al.
4.33
Attending to Multimodal Generation One Token at a Time (2026)
Varun Gupta et al.
4.33
IKS-Instruct: A 24,000-Example Multilingual Dataset for Teaching Language Models Indian Knowledge Systems (2026)
Shwetha Singaravelu et al.
4.33
ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models (2025)
Amirhosein Chahe and Lifeng Zhou
4.30
Continual Learning For Generative AI: From Llms To Mllms And Beyond (2025)
Haiyang Guo, Fanhu Zeng, Fei Zhu, et al.
4.29
VCIFBench: Evaluating Complex Instruction Following for Video Understanding (2026)
Huangchen Xu et al.
4.27
BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine (2026)
Yang Liu et al.
4.27
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance (2026)
Kaustav Kundu et al.
4.27
Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning (2026)
Yu Zhu et al.
4.27
Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models (2026)
Hanyang Chen et al.
4.27
The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages (2026)
Miso Choi et al.
4.27
Curvature-Guided Mixing for MLLM Adaptation (2026)
Jinglong Yang et al.
4.27
Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety (2026)
Shikai Qiu et al.
4.27
SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2,391 Hours of Open and Minimally Invasive Surgery (2026)
Filippos Bellos et al.
4.27
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models (2026)
Fengjie Lu et al.
4.27
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models (2026)
Iosif Tsangko et al.
4.22
Personalize Your Large Vision-language Models With In-context Prompt Tuning (2026)
Yanshu Li et al.
4.22
Mindomni: Unleashing Reasoning Generation In Vision Language Models With RGPO (2025)
Yicheng Xiao, Lin Song, Yukang Chen, et al.
4.21