Awesome Multimodal

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🎯Research Radar 🔖Saved

← all topics overview

Image-Text Retrieval

loading…

Stay Updated

E-Mail Digest 🎯 Research Radar

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Image-Text Retrieval — curated papers, datasets & benchmarks · Awesome Multimodal

← all topics overview

Awesome Image-Text Retrieval

Image-Text Retrieval is one of the most active areas in Awesome Multimodal — 1,998 papers in this collection, evaluated on datasets like MSCOCO, Flickr30k, COCO. A strong starting point is "ViQ: Text-Aligned Visual Quantized Representations at Any Resolution".

Datasets & benchmarks

MSCOCO50 papers

Flickr30k43 papers

MSR-VTT19 papers

InfoSeek10 papers

Conceptual Captions10 papers

ImageNet-1k9 papers

MMEB-v-28 papers

Key papers

60 papers · trending (default)numbers = 🔥 heat

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution (2026)
Xumin Yu et al.
12.77
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models (2026)
Wenxuan Huang et al.
12.47
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval (2026)
Issar Tzachor et al.
12.05
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)
Yu Zeng et al.
11.94
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models (2026)
Zengbin Wang et al.
11.75
Oxygen-TryOn: Fashion-Native Foundation Model for Any-item Virtual Try-On (2026)
Yong Liu et al.
11.70
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)
Huichao Zhang et al.
10.55
Cross-Modal Retrieval for Motion and Text via DropTriple Loss (2023)
Sheng Yan et al.
10.35
RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval (2026)
Tyler Skow et al.
10.25
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning (2025)
Zhibin Lan et al.
10.20
Scaling Native Multimodal Pre-Training From Scratch (2026)
Haoyuan Wu et al.
10.19
Text-Vision Co-Instructed Image Editing (2026)
Chenxi Xie et al.
9.79
IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation (2026)
Zixuan Li et al.
9.48
Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)
Siddharth Srivastava, Gaurav Sharma
9.34
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training (2026)
Peng Sun et al.
9.30
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding (2026)
Sheng-Yu Huang et al.
8.70
CanvasAgent: Enabling Complex Image Creation and Editing via Visual Tool Orchestration (2026)
Hairui Zhu et al.
8.40
ViTextVQA: A Large-Scale Visual Question Answering Dataset and a Novel Multimodal Feature Fusion Method for Vietnamese Text Comprehension in Images (2024)
Quan Van Nguyen et al.
8.37
ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering (2026)
ZhengXian Wu et al.
8.34
Hierarchical Sparse Attention Done Right: Toward Infinite Context Modeling (2026)
Xiang Hu et al.
8.18
Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning (2026)
Jiayi Lei et al.
8.12
Vision-language Modeling Meets Remote Sensing: Models, Datasets And Perspectives (2025)
Xingxing Weng, Chao Pang, Gui-Song Xia
7.88
Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature (2026)
Subham Ghosh et al.
7.86
Sgdfuse: Sam-guided Diffusion Model For High-fidelity Infrared And Visible Image Fusion (2025)
Xiaoyang Zhang, Jinjiang Li, Guodong Fan, et al.
7.76
Chat-driven Text Generation And Interaction For Person Retrieval (2025)
Zequn Xie, Chuxin Wang, Sihang Cai, et al.
7.75
Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)
Leqi Shen, Guoqiang Gong, Tianxiang Hao, et al.
7.63
Jina-embeddings-v4: Universal Embeddings For Multimodal Multilingual Retrieval (2025)
Michael Günther, Saba Sturua, Mohammad Kalim Akram, et al.
7.61
Human-centered Interactive Learning Via Mllms For Text-to-image Person Re-identification (2025)
Yang Qin, Chao Chen, Zhihang Fu, et al.
7.61
A Context-aware Attention And Graph Neural Network-based Multimodal Framework For Misogyny Detection (2025)
Mohammad Zia Ur Rehman, Sufyaan Zahoor, Areeb Manzoor, et al.
7.61
Semirnet: A Semantic Irony Recognition Network For Multimodal Sarcasm Detection (2025)
Jingxuan Zhou, Yuehao Wu, Yibo Zhang, et al.
7.46
CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification (2025)
Huazhong Zhao et al.
7.33
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing (2026)
Tingyu Song et al.
7.28
MIRA: A Novel Framework For Fusing Modalities In Medical RAG (2025)
Jinhong Wang, Tajamul Ashraf, Zongyan Han, et al.
7.27
From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models (2026)
Haoxiang Sun et al.
7.16
Towards Vision-Language Geo-Foundation Model: A Survey (2024)
Yue Zhou et al.
7.00
Robust Multimodal Sentiment Analysis Of Image-text Pairs By Distribution-based Feature Recovery And Fusion (2025)
Daiqing Wu, Dongbao Yang, Yu Zhou, et al.
6.95
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting (2024)
Siyu Jiao et al.
6.84
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval (2022)
Xiang Fang et al.
6.77
MMRL: Multi-Modal Representation Learning for Vision-Language Models (2025)
Yuncheng Guo et al.
6.58
Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning (2025)
Wenchuan Zhang et al.
6.39
INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval (2026)
Zhiwei Chen et al.
6.29
HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network (2026)
Mingyu Zhang et al.
6.24
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space (2025)
Chao Chen et al.
6.23
Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching (2025)
Junyu Chen et al.
6.07
Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification (2023)
Jiachen Li and Xiaojin Gong
6.06
MAGE: Multimodal Alignment And Generation Enhancement Via Bridging Visual And Semantic Spaces (2025)
Shaojun E, Yuchen Yang, Jiaheng Wu, et al.
6.02
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification (2025)
Yuhao Wang and Yongfeng Lv and Pingping Zhang and Huchuan Lu
5.84
Mobileclip2: Improving Multi-modal Reinforced Training (2025)
Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, et al.
5.83
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing (2026)
Runze He et al.
5.79
Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning (2026)
Yang Liu et al.
5.76
SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search (2026)
Ming Dai et al.
5.76
DRKF: Decoupled Representations With Knowledge Fusion For Multimodal Emotion Recognition (2025)
Peiyuan Jiang, Yao Liu, Qiao Liu, et al.
5.56
Recognition-Synergistic Scene Text Editing (2025)
Zhengyao Fang et al.
5.54
Render-of-thought: Rendering Textual Chain-of-thought As Images For Visual Latent Reasoning (2026)
Yifan Wang, Shiyu Li, Peiming Li, et al.
5.52
VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings (2025)
Ramin Giahi et al.
5.40
Diffris: Enhancing Referring Remote Sensing Image Segmentation With Pre-trained Text-to-image Diffusion Models (2025)
Zhe Dong, Yuzhe Sun, Tianzhu Liu, et al.
5.35
Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)
Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, et al.
5.28
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs (2025)
Mothilal Asokan et al.
5.24
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking (2026)
Mingxin Li et al.
5.19
Chrono: A Simple Blueprint for Representing Time in MLLMs (2024)
Hector Rodriguez et al.
5.18