Awesome Instruction Tuning
Instruction Tuning is one of the most active areas in Awesome Multimodal β 806 papers in this collection, evaluated on datasets like MME, LIBERO, R-2R. A strong starting point is "InterleaveThinker: Reinforcing Agentic Interleaved Generation".
Datasets & benchmarks
Key papers
- InterleaveThinker: Reinforcing Agentic Interleaved Generation (2026)Dian Zheng et al.14.38
- DreamX-World 1.0: A General-Purpose Interactive World Model (2026)DreamX Team et al.14.04
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)Baochang Ren et al.13.59
- OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)Xinyue Cai et al.10.59
- On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models (2026)Chongyang Zhao et al.9.54
- Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training (2026)Peng Sun et al.9.41
- RepWAM: World Action Modeling with Representation Visual-Action Tokenizers (2026)Junke Wang et al.9.19
- VINO: A Unified Visual Generator with Interleaved OmniModal Context (2026)Junyi Chen et al.9.10
- Vision-language Modeling Meets Remote Sensing: Models, Datasets And Perspectives (2025)Xingxing Weng, Chao Pang, Gui-Song Xia7.99
- Mathcoder-vl: Bridging Vision And Code For Enhanced Multimodal Mathematical Reasoning (2025)Ke Wang, Junting Pan, Linda Wei, et al.7.44
- Qoq-med: Building Multimodal Clinical Foundation Models With Domain-aware GRPO Training (2025)Wei Dai, Peilin Chen, Chanakya Ekbote, et al.6.95
- MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets (2023)Lai Wei et al.6.88
- Shapellm-omni: A Native Multimodal LLM For 3D Generation And Understanding (2025)Junliang Ye, Zhengyi Wang, Ruowen Zhao, et al.6.82
- Janus-Pro: Unified Multimodal Understanding and Generation with Data and
Model Scaling (2025)Xiaokang Chen et al.6.58
- Omnigen2: Towards Instruction-aligned Multimodal Generation (2025)Chenyuan Wu, Pengfei Zheng, Ruiran Yan, et al.6.44
- PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization (2026)Songhan Jiang et al.6.34
- Medical Knowledge Intervention Prompt Tuning For Medical Image Classification (2025)Ye Du, Nanxi Yu, Shujun Wang6.23
- Mca-llava: Manhattan Causal Attention For Reducing Hallucination In Large Vision-language Models (2025)Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, et al.6.19
- MIMO: A Medical Vision Language Model With Visual Referring Multimodal Input And Pixel Grounding Multimodal Output (2025)Yanyuan Chen, Dexuan Xu, Yu Huang, et al.6.12
- Vquala 2025 Challenge On Visual Quality Comparison For Large Multimodal Models: Methods And Results (2025)Hanwei Zhu, Haoning Wu, Zicheng Zhang, et al.6.12
- Infigui-g1: Advancing GUI Grounding With Adaptive Exploration Policy Optimization (2025)Yuhang Liu, Zeyu Liu, Shuanghe Zhu, et al.5.98
- Hierarchical-task-aware Multi-modal Mixture Of Incremental Lora Experts For Embodied Continual Learning (2025)Ziqi Jia, Anmin Wang, Xiaoyang Qu, et al.5.82
- Noteit: A System Converting Instructional Videos To Interactable Notes Through Multimodal Video Understanding (2025)Running Zhao, Zhihan Jiang, Xinchen Zhang, et al.5.46
- ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization (2024)Fanrui Zhang et al.5.40
- Emovlm-kd: Fusing Distilled Expertise With Vision-language Models For Visual Emotion Analysis (2025)Sangeun Lee, Yubeen Lee, Eunil Park5.04
- Scaling Up Biomedical Vision-language Models: Fine-tuning, Instruction Tuning, And Multi-modal Learning (2025)Cheng Peng, Kai Zhang, Mengxian Lyu, et al.5.04
- Trajectory-Level Redirection Attacks on Vision-Language-Action Models (2026)Gokul Puthumanaillam et al.5.01
- S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents (2026)Yao Dong et al.5.01
- Understanding the Behaviors of Environment-aware Information Retrieval (2026)Ruifeng Yuan et al.5.01
- The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning (2026)Ruobing Zheng et al.4.81
- Adaptive Markup Language Generation For Contextually-grounded Visual Document Understanding (2025)Han Xiao, Yina Xie, Guanxin Tan, et al.4.67
- Multi-modal Multi-task (M3T) Federated Foundation Models For Embodied AI: Potentials And Challenges For Edge Integration (2025)Kasra Borazjani, Payam Abdisarabshali, Fardis Nadimi, et al.4.53
- Shizhengpt: Towards Multimodal Llms For Traditional Chinese Medicine (2025)Junying Chen, Zhenyang Cai, Zhiheng Liu, et al.4.53
- Collaborative Multi-lora Experts With Achievement-based Multi-tasks Loss For Unified Multimodal Information Extraction (2025)Li Yuan, Yi Cai, Xudong Shen, et al.4.53
- MC-LLaVA: Multi-Concept Personalized Vision-Language Model (2024)Ruichuan An et al.4.52
- Continual Learning For Generative AI: From Llms To Mllms And Beyond (2025)Haiyang Guo, Fanhu Zeng, Fei Zhu, et al.4.40
- VCIFBench: Evaluating Complex Instruction Following for Video Understanding (2026)Huangchen Xu et al.4.39
- BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine (2026)Yang Liu et al.4.39
- Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance (2026)Kaustav Kundu et al.4.39
- Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning (2026)Yu Zhu et al.4.39
- DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation (2026)Hequan Wang et al.4.39
- An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing (2026)Yiwei Ma et al.4.39
- Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse? (2026)Jason M Pittman et al.4.39
- Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models (2026)Hanyang Chen et al.4.39
- Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning (2026)Zhenyu Yu4.39
- Task-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning (2026)Donghyun Han et al.4.39
- The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages (2026)Miso Choi et al.4.39
- SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks (2026)Jingru Guo et al.4.39
- Text-Vision Co-Instructed Image Editing (2026)Chenxi Xie et al.4.39
- Redirecting the Flow: Image Customization through Attention Distribution Shift (2026)Jie Li et al.4.39
- The Value Axis: Language Models Encode Whether They're on the Right Track (2026)Nick Jiang et al.4.39
- Mindomni: Unleashing Reasoning Generation In Vision Language Models With RGPO (2025)Yicheng Xiao, Lin Song, Yukang Chen, et al.4.37
- A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models (2026)Iosif Tsangko et al.4.33
- LENS: Learning To Segment Anything With Unified Reinforced Reasoning (2025)Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, et al.4.30
- Llava-pose: Enhancing Human Pose And Action Understanding Via Keypoint-integrated Instruction Tuning (2025)Dewen Zhang, Tahir Hussain, Wangpeng An, et al.4.28
- Vision-language Models On The Edge For Real-time Robotic Perception (2026)Sarat Ahmad, Maryam Hafeez, Syed Ali Raza Zaidi4.26
- When Generative AI Meets Extended Reality: Enabling Scalable And Natural Interactions (2026)Mingyu Zhu, Jiangong Chen, Bin Li4.26
- MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence (2026)Xingyilang Yin et al.4.20
- Global Context Compression with Interleaved Vision-Text Transformation (2026)Dian Jiao et al.4.08
- Dermogpt: Open Weights And Open Data For Morphology-grounded Dermatological Reasoning Mllms (2026)Jinghan Ru, Siyuan Yan, Yuguo Yin, et al.4.08