Awesome Visual Language
Visual Language is one of the most active areas in Awesome Computer Vision β 748 papers in this collection, evaluated on datasets like COCO, LVIS, RefCOCO. A strong starting point is "Meshed-memory Transformer For Image Captioning".
Datasets & benchmarks
Key papers
- Meshed-memory Transformer For Image Captioning (2019)Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, et al.27.73
- A Survey On Visual Transformer (2020)Kai Han, Yunhe Wang, Hanting Chen, et al.26.80
- Unified Vision-language Pre-training For Image Captioning And VQA (2019)Luowei Zhou, Hamid Palangi, Lei Zhang, et al.25.70
- Visual Semantic Reasoning For Image-text Matching (2019)Kunpeng Li, Yulun Zhang, Kai Li, et al.25.23
- Grounding DINO: Marrying DINO With Grounded Pre-training For Open-set Object Detection (2023)Shilong Liu, Zhaoyang Zeng, Tianhe Ren, et al.23.46
- Dual-level Collaborative Transformer For Image Captioning (2021)Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, et al.22.95
- Vinvl: Revisiting Visual Representations In Vision-language Models (2021)Pengchuan Zhang, Xiujun Li, Xiaowei Hu, et al.21.40
- Vadclip: Adapting Vision-language Models For Weakly Supervised Video Anomaly Detection (2023)Peng Wu, Xuerong Zhou, Guansong Pang, et al.20.92
- Yolo-world: Real-time Open-vocabulary Object Detection (2024)Tianheng Cheng, Lin Song, Yixiao Ge, et al.20.53
- Regionclip: Region-based Language-image Pretraining (2021)Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al.20.18
- Multi-branch And Multi-scale Attention Learning For Fine-grained Visual Categorization (2020)Fan Zhang, Meng Li, Guisheng Zhai, et al.20.02
- Image Segmentation Using Text And Image Prompts (2021)Timo LΓΌddecke, Alexander S. Ecker19.96
- Open-vocabulary Semantic Segmentation With Mask-adapted CLIP (2022)Feng Liang, Bichen Wu, Xiaoliang Dai, et al.19.49
- Open-vocabulary Object Detection Using Captions (2020)Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, et al.19.34
- LAVT: Language-aware Vision Transformer For Referring Image Segmentation (2021)Zhao Yang, Jiaqi Wang, Yansong Tang, et al.18.95
- Neural Baby Talk (2018)Jiasen Lu, Jianwei Yang, Dhruv Batra, et al.18.79
- Tip-adapter: Training-free Adaption Of CLIP For Few-shot Classification (2022)Renrui Zhang, Zhang Wei, Rongyao Fang, et al.18.65
- T-rex2: Towards Generic Object Detection Via Text-visual Prompt Synergy (2024)Qing Jiang, Feng Li, Zhaoyang Zeng, et al.18.52
- Cricavpr: Cross-image Correlation-aware Representation Learning For Visual Place Recognition (2024)Feng Lu, Xiangyuan Lan, Lijun Zhang, et al.18.09
- Clip-reid: Exploiting Vision-language Model For Image Re-identification Without Concrete Text Labels (2022)Siyuan Li, Li Sun, Qingli Li17.74
- SLIP: Self-supervision Meets Language-image Pre-training (2021)Norman Mu, Alexander Kirillov, David Wagner, et al.17.34
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)Alberto Baldrati, Marco Bertini, Tiberio Uricchio, et al.16.84
- Cross Language Image Matching For Weakly Supervised Semantic Segmentation (2022)Jinheng Xie, Xianxu Hou, Kai Ye, et al.16.71
- Seeing Out Of The Box: End-to-end Pre-training For Vision-language Representation Learning (2021)Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, et al.16.67
- Scaling Up Vision-language Pre-training For Image Captioning (2021)Xiaowei Hu, Zhe Gan, Jianfeng Wang, et al.16.49
- Frozen CLIP Models Are Efficient Video Learners (2022)Ziyi Lin, Shijie Geng, Renrui Zhang, et al.16.47
- Maskclip: Masked Self-distillation Advances Contrastive Language-image Pretraining (2022)Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, et al.16.25
- Sparsett: Visual Tracking With Sparse Transformers (2022)Zhihong Fu, Zehua Fu, Qingjie Liu, et al.16.19
- Mitigating Object Hallucinations In Large Vision-language Models Through Visual Contrastive Decoding (2023)Sicong Leng, Hang Zhang, Guanzheng Chen, et al.16.19
- Scene Text Retrieval Via Joint Text Detection And Similarity Learning (2021)Hao Wang, Xiang Bai, Mingkun Yang, et al.16.16
- Aligning And Prompting Everything All At Once For Universal Visual Perception (2023)Yunhang Shen, Chaoyou Fu, Peixian Chen, et al.15.92
- Polysemy Deciphering Network For Robust Human-object Interaction Detection (2020)Xubin Zhong, Changxing Ding, Xian Qu, et al.15.64
- Clip-dinoiser: Teaching CLIP A Few DINO Tricks For Open-vocabulary Semantic Segmentation (2023)Monika WysoczaΕska, Oriane SimΓ©oni, MichaΓ«l Ramamonjisoa, et al.15.63
- A Survey Of Vision-language Pre-trained Models (2022)Yifan Du, Zikang Liu, Junyi Li, et al.15.62
- MDMMT: Multidomain Multimodal Transformer For Video Retrieval (2021)Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, et al.15.51
- When Do We Not Need Larger Vision Models? (2024)Baifeng Shi, Ziyang Wu, Maolin Mao, et al.15.46
- Injecting Semantic Concepts Into End-to-end Image Captioning (2021)Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, et al.15.43
- Delta Descriptors: Change-based Place Representation For Robust Visual Localization (2020)Sourav Garg, Ben Harwood, Gaurangi Anand, et al.15.26
- GRIT: Faster And Better Image Captioning Transformer Using Dual Visual Features (2022)van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani15.19
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, et al.15.16
- Describing And Localizing Multiple Changes With Transformers (2021)Yue Qiu, Shintaro Yamamoto, Kodai Nakashima, et al.15.10
- Learning Reinforced Attentional Representation For End-to-end Visual Tracking (2019)Peng Gao, Qiquan Zhang, Fei Wang, et al.14.83
- Cap2det: Learning To Amplify Weak Caption Supervision For Object Detection (2019)Keren Ye, Mingda Zhang, Adriana Kovashka, et al.14.72
- Phrasecut: Language-based Image Segmentation In The Wild (2020)Chenyun Wu, Zhe Lin, Scott Cohen, et al.14.66
- Local-global Context Aware Transformer For Language-guided Video Segmentation (2022)Chen Liang, Wenguan Wang, Tianfei Zhou, et al.14.62
- Exploiting Unlabeled Data With Vision And Language Models For Object Detection (2022)Shiyu Zhao, Zhixing Zhang, Samuel Schulter, et al.14.47
- Image Captioning Through Image Transformer (2020)Sen He, Wentong Liao, Hamed R. Tavakoli, et al.14.43
- InterleaveThinker: Reinforcing Agentic Interleaved Generation (2026)Dian Zheng et al.14.38
- Revisiting Weakly Supervised Pre-training Of Visual Perception Models (2022)Mannat Singh, Laura Gustafson, Aaron Adcock, et al.14.27
- Open-vocabulary Instance Segmentation Via Robust Cross-modal Pseudo-labeling (2021)Dat Huynh, Jason Kuen, Zhe Lin, et al.14.15
- An Empirical Study Of CLIP For Text-based Person Search (2023)Min Cao, Yang Bai, Ziyin Zeng, et al.14.11
- Fine-grained Image Captioning With Global-local Discriminative Objective (2020)Jie Wu, Tianshui Chen, Hefeng Wu, et al.14.06
- Crowdclip: Unsupervised Crowd Counting Via Vision-language Model (2023)Dingkang Liang, Jiahao Xie, Zhikang Zou, et al.13.97
- Align2ground: Weakly Supervised Phrase Grounding Guided By Image-caption Alignment (2019)Samyak Datta, Karan Sikka, Anirban Roy, et al.13.93
- Real-time Visual Object Tracking With Natural Language Description (2019)Qi Feng, Vitaly Ablavsky, Qinxun Bai, et al.13.93
- GSVA: Generalized Segmentation Via Multimodal Large Language Models (2023)Zhuofan Xia, Dongchen Han, Yizeng Han, et al.13.84
- Clip-count: Towards Text-guided Zero-shot Object Counting (2023)Ruixiang Jiang, Lingbo Liu, Changwen Chen13.79
- CLIP Models Are Few-shot Learners: Empirical Studies On VQA And Visual Entailment (2022)Haoyu Song, Li Dong, Wei-Nan Zhang, et al.13.79
- Building An Open-vocabulary Video CLIP Model With Better Architectures, Optimization And Data (2023)Zuxuan Wu, Zejia Weng, Wujian Peng, et al.13.75
- Clip-guided Prototype Modulating For Few-shot Action Recognition (2023)Xiang Wang, Shiwei Zhang, Jun Cen, et al.13.65