Awesome Vision-Language
Vision-Language is one of the most active areas in Awesome LLM Papers β 987 papers in this collection, evaluated on datasets like MathVista, GSM8K, MMLU. A strong starting point is "Training Language Models To Follow Instructions With Human Feedback".
Datasets & benchmarks
Key papers
- Training Language Models To Follow Instructions With Human Feedback (2022)Long Ouyang, Jeff Wu, Xu Jiang, et al.36.92
- Llama: Open And Efficient Foundation Language Models (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al.36.83
- React: Synergizing Reasoning And Acting In Language Models (2022)Shunyu Yao, Jeffrey Zhao, Dian Yu, et al.36.63
- SPHINX: The Joint Mixing Of Weights, Tasks, And Visual Embeddings For Multi-modal Large Language Models (2023)Ziyi Lin, Chris Liu, Renrui Zhang, et al.31.35
- The Dawn Of Lmms: Preliminary Explorations With Gpt-4v(ision) (2023)Zhengyuan Yang, Linjie Li, Kevin Lin, et al.30.34
- Aya Model: An Instruction Finetuned Open-access Multilingual Language Model (2024)Ahmet ΓstΓΌn, Viraat Aryabumi, Zheng-Xin Yong, et al.27.57
- Fine-tuning Language Models For Factuality (2023)Katherine Tian, Eric Mitchell, Huaxiu Yao, et al.25.62
- Eagle: Exploring The Design Space For Multimodal Llms With Mixture Of Encoders (2024)Min Shi, Fuxiao Liu, Shihao Wang, et al.25.55
- Llamax: Scaling Linguistic Horizons Of LLM By Enhancing Translation Capabilities Beyond 100 Languages (2024)Yinquan Lu, Wenhao Zhu, Lei Li, et al.25.17
- Layoutllm: Layout Instruction Tuning With Large Language Models For Document Understanding (2024)Chuwei Luo, Yufan Shen, Zhaoqing Zhu, et al.24.66
- MA-LMM: Memory-augmented Large Multimodal Model For Long-term Video Understanding (2024)Bo He, Hengduo Li, Young Kyun Jang, et al.24.50
- Imagebind-llm: Multi-modality Instruction Tuning (2023)Jiaming Han, Renrui Zhang, Wenqi Shao, et al.24.37
- A Systematic Survey Of Prompt Engineering In Large Language Models: Techniques And Applications (2024)Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, et al.24.27
- Lm-cocktail: Resilient Tuning Of Language Models Via Model Merging (2023)Shitao Xiao, Zheng Liu, Peitian Zhang, et al.23.85
- Cobra: Extending Mamba To Multi-modal Large Language Model For Efficient Inference (2024)Han Zhao, Min Zhang, Wei Zhao, et al.23.43
- You Only Look At Screens: Multimodal Chain-of-action Agents (2023)Zhuosheng Zhang, Aston Zhang23.42
- MMC: Advancing Multimodal Chart Understanding With Large-scale Instruction Tuning (2023)Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, et al.23.28
- SPHINX-X: Scaling Data And Parameters For A Family Of Multi-modal Large Language Models (2024)Dongyang Liu, Renrui Zhang, Longtian Qiu, et al.23.22
- Llava-phi: Efficient Multi-modal Assistant With Small Language Model (2024)Yichen Zhu, Minjie Zhu, Ning Liu, et al.22.96
- Visrag: Vision-based Retrieval-augmented Generation On Multi-modality Documents (2024)Shi Yu, Chaoyue Tang, Bokai Xu, et al.22.87
- Quantifying Language Models' Sensitivity To Spurious Features In Prompt Design Or: How I Learned To Start Worrying About Prompt Formatting (2023)Melanie Sclar, Yejin Choi, Yulia Tsvetkov, et al.22.72
- Ferret-v2: An Improved Baseline For Referring And Grounding With Large Language Models (2024)Haotian Zhang, Haoxuan You, Philipp Dufter, et al.22.68
- Mono-internvl: Pushing The Boundaries Of Monolithic Multimodal Large Language Models With Endogenous Visual Pre-training (2024)Gen Luo, Xue Yang, Wenhan Dou, et al.22.61
- Ovis: Structural Embedding Alignment For Multimodal Large Language Model (2024)Shiyin Lu, Yang Li, Qing-Guo Chen, et al.22.55
- Mm-safetybench: A Benchmark For Safety Evaluation Of Multimodal Large Language Models (2023)Xin Liu, Yichen Zhu, Jindong Gu, et al.22.50
- Fine-tuning Multimodal Llms To Follow Zero-shot Demonstrative Instructions (2023)Juncheng Li, Kaihang Pan, Zhiqi Ge, et al.22.33
- AMBER: An Llm-free Multi-dimensional Benchmark For Mllms Hallucination Evaluation (2023)Junyang Wang, Yuhang Wang, Guohai Xu, et al.22.17
- Mdpo: Conditional Preference Optimization For Multimodal Large Language Models (2024)Fei Wang, Wenxuan Zhou, James Y. Huang, et al.22.13
- Next-gpt: Any-to-any Multimodal LLM (2023)Shengqiong Wu, Hao Fei, Leigang Qu, et al.21.72
- List Items One By One: A New Data Source And Learning Paradigm For Multimodal Llms (2024)An Yan, Zhengyuan Yang, Junda Wu, et al.21.63
- Eyes Wide Shut? Exploring The Visual Shortcomings Of Multimodal Llms (2024)Shengbang Tong, Zhuang Liu, Yuexiang Zhai, et al.21.34
- Link-context Learning For Multimodal Llms (2023)Yan Tai, Weichen Fan, Zhao Zhang, et al.21.31
- Fingpt: Large Generative Models For A Small Language (2023)Risto Luukkonen, Ville Komulainen, Jouni Luoma, et al.21.27
- PPTC Benchmark: Evaluating Large Language Models For Powerpoint Task Completion (2023)Yiduo Guo, Zekai Zhang, Yaobo Liang, et al.21.19
- Rethinking Machine Unlearning For Large Language Models (2024)Sijia Liu, Yuanshun Yao, Jinghan Jia, et al.21.08
- Groundinggpt:language Enhanced Multi-modal Grounding Model (2024)Zhaowei Li, Qi Xu, Dong Zhang, et al.20.91
- U-llava: Unifying Multi-modal Tasks Via Large Language Model (2023)Jinjin Xu, Liwu Xu, Yuzhe Yang, et al.20.67
- Language Models Are Homer Simpson! Safety Re-alignment Of Fine-tuned Language Models Through Task Arithmetic (2024)Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria20.43
- Position-enhanced Visual Instruction Tuning For Multimodal Large Language Models (2023)Chi Chen, Ruoyu Qin, Fuwen Luo, et al.19.90
- Exploring The Role Of Large Language Models In Prompt Encoding For Diffusion Models (2024)Bingqi Ma, Zhuofan Zong, Guanglu Song, et al.19.55
- LLM Comparator: Visual Analytics For Side-by-side Evaluation Of Large Language Models (2024)Minsuk Kahng, Ian Tenney, Mahima Pushkarna, et al.19.44
- Frozen Transformers In Language Models Are Effective Visual Encoder Layers (2023)Ziqi Pang, Ziyang Xie, Yunze Man, et al.19.32
- From Persona To Personalization: A Survey On Role-playing Language Agents (2024)Jiangjie Chen, Xintao Wang, Rui Xu, et al.19.12
- INCLUDE: Evaluating Multilingual Language Understanding With Regional Knowledge (2024)Angelika Romanou, Negar Foroutan, Anna Sotnikova, et al.19.08
- Mllm-as-a-judge: Assessing Multimodal Llm-as-a-judge With Vision-language Benchmark (2024)Dongping Chen, Ruoxi Chen, Shilin Zhang, et al.19.04
- Making Llama SEE And Draw With SEED Tokenizer (2023)Yuying Ge, Sijie Zhao, Ziyun Zeng, et al.18.81
- Codi-2: In-context, Interleaved, And Interactive Any-to-any Generation (2023)Zineng Tang, Ziyi Yang, Mahmoud Khademi, et al.18.73
- Longagent: Scaling Language Models To 128k Context Through Multi-agent Collaboration (2024)Jun Zhao, Can Zu, Hao Xu, et al.18.61
- From GPT-4 To Gemini And Beyond: Assessing The Landscape Of Mllms On Generalizability, Trustworthiness And Causality Through Four Modalities (2024)Chaochao Lu, Chen Qian, Guodong Zheng, et al.18.43
- Navgpt-2: Unleashing Navigational Reasoning Capability For Large Vision-language Models (2024)Gengze Zhou, Yicong Hong, Zun Wang, et al.18.43
- How Easy Is It To Fool Your Multimodal Llms? An Empirical Analysis On Deceptive Prompts (2024)Yusu Qian, Haotian Zhang, Yinfei Yang, et al.18.29
- Worldgpt: Empowering LLM As Multimodal World Model (2024)Zhiqi Ge, Hongzhe Huang, Mingze Zhou, et al.18.08
- Draw-and-understand: Leveraging Visual Prompts To Enable Mllms To Comprehend What You Want (2024)Weifeng Lin, Xinyu Wei, Ruichuan An, et al.18.04
- Unleashing The Potential Of Prompt Engineering For Large Language Models (2023)Banghao Chen, Zhaofeng Zhang, Nicolas LangrenΓ©, et al.17.87
- Pink: Unveiling The Power Of Referential Comprehension For Multi-modal Llms (2023)Shiyu Xuan, Qingpei Guo, Ming Yang, et al.17.64
- Mmevalpro: Calibrating Multimodal Benchmarks Towards Trustworthy And Efficient Evaluation (2024)Jinsheng Huang, Liang Chen, Taian Guo, et al.17.18
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, et al.17.06
- Luminate: Structured Generation And Exploration Of Design Space With Large Language Models For Human-ai Co-creation (2023)Sangho Suh, Meng Chen, Bryan Min, et al.16.90
- Llms Meet Multimodal Generation And Editing: A Survey (2024)Yingqing He, Zhaoyang Liu, Jingye Chen, et al.16.87
- Llms Are Few-shot In-context Low-resource Language Learners (2024)Samuel Cahyawijaya, Holy Lovenia, Pascale Fung16.84