Awesome Image-Text Retrieval
Image-Text Retrieval is one of the most active areas in Awesome Multimodal β 1,729 papers in this collection, evaluated on datasets like MS COCO, Flickr30k, InfoSeek. A strong starting point is "Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models".
Datasets & benchmarks
Key papers
- Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models (2026)Wenxuan Huang et al.12.58
- VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval (2026)Issar Tzachor et al.12.16
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)Yu Zeng et al.12.05
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models (2026)Zengbin Wang et al.11.87
- NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)Huichao Zhang et al.10.66
- Cross-Modal Retrieval for Motion and Text via DropTriple Loss (2023)Sheng Yan et al.10.35
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning (2025)Zhibin Lan et al.10.31
- Text-Vision Co-Instructed Image Editing (2026)Chenxi Xie et al.9.90
- Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)Siddharth Srivastava, Gaurav Sharma9.46
- Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training (2026)Peng Sun et al.9.41
- OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding (2026)Sheng-Yu Huang et al.8.81
- ViTextVQA: A Large-Scale Visual Question Answering Dataset and a Novel Multimodal Feature Fusion Method for Vietnamese Text Comprehension in Images (2024)Quan Van Nguyen et al.8.37
- RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval (2026)Tyler Skow et al.8.11
- Vision-language Modeling Meets Remote Sensing: Models, Datasets And Perspectives (2025)Xingxing Weng, Chao Pang, Gui-Song Xia7.99
- Chat-driven Text Generation And Interaction For Person Retrieval (2025)Zequn Xie, Chuxin Wang, Sihang Cai, et al.7.72
- A Context-aware Attention And Graph Neural Network-based Multimodal Framework For Misogyny Detection (2025)Mohammad Zia Ur Rehman, Sufyaan Zahoor, Areeb Manzoor, et al.7.57
- Semirnet: A Semantic Irony Recognition Network For Multimodal Sarcasm Detection (2025)Jingxuan Zhou, Yuehao Wu, Yibo Zhang, et al.7.41
- Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing (2026)Tingyu Song et al.7.40
- MIRA: A Novel Framework For Fusing Modalities In Medical RAG (2025)Jinhong Wang, Tajamul Ashraf, Zongyan Han, et al.7.39
- Towards Vision-Language Geo-Foundation Model: A Survey (2024)Yue Zhou et al.7.00
- Human-centered Interactive Learning Via Mllms For Text-to-image Person Re-identification (2025)Yang Qin, Chao Chen, Zhihang Fu, et al.6.86
- CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting (2024)Siyu Jiao et al.6.84
- Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval (2022)Xiang Fang et al.6.77
- Robust Multimodal Sentiment Analysis Of Image-text Pairs By Distribution-based Feature Recovery And Fusion (2025)Daiqing Wu, Dongbao Yang, Yu Zhou, et al.6.64
- Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning (2025)Wenchuan Zhang et al.6.51
- Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space (2025)Chao Chen et al.6.34
- MAGE: Multimodal Alignment And Generation Enhancement Via Bridging Visual And Semantic Spaces (2025)Shaojun E, Yuchen Yang, Jiaheng Wu, et al.6.13
- Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification (2023)Jiachen Li and Xiaojin Gong6.06
- Sgdfuse: Sam-guided Diffusion Model For High-fidelity Infrared And Visible Image Fusion (2025)Xiaoyang Zhang, Jinjiang Li, Guodong Fan, et al.6.00
- Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing (2026)Runze He et al.5.91
- Render-of-thought: Rendering Textual Chain-of-thought As Images For Visual Latent Reasoning (2026)Yifan Wang, Shiyu Li, Peiming Li, et al.5.91
- Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning (2026)Yang Liu et al.5.88
- Mobileclip2: Improving Multi-modal Reinforced Training (2025)Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, et al.5.83
- Jina-embeddings-v4: Universal Embeddings For Multimodal Multilingual Retrieval (2025)Michael GΓΌnther, Saba Sturua, Mohammad Kalim Akram, et al.5.82
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)Leqi Shen, Guoqiang Gong, Tianxiang Hao, et al.5.79
- DRKF: Decoupled Representations With Knowledge Fusion For Multimodal Emotion Recognition (2025)Peiyuan Jiang, Yao Liu, Qiao Liu, et al.5.67
- Diffris: Enhancing Referring Remote Sensing Image Segmentation With Pre-trained Text-to-image Diffusion Models (2025)Zhe Dong, Yuzhe Sun, Tianzhu Liu, et al.5.46
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, et al.5.39
- CONQUER: Context-aware Representation With Query Enhancement For Text-based Person Search (2026)Zequn Xie5.36
- Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation (2025)Yimu Wang et al.5.24
- Simpledoc: Multi-modal Document Understanding With Dual-cue Page Retrieval And Iterative Refinement (2025)Chelsi Jain, Yiran Wu, Yifan Zeng, et al.5.23
- Chrono: A Simple Blueprint for Representing Time in MLLMs (2024)Hector Rodriguez et al.5.18
- VIOLA: Towards Video In-Context Learning with Minimal Annotations (2026)Ryo Fujii et al.5.18
- Artiscene: Language-driven Artistic 3D Scene Generation Through Image Intermediary (2025)Zeqi Gu, Yin Cui, Zhaoshuo Li, et al.5.04
- Openevents V1: Large-scale Benchmark Dataset For Multimodal Event Grounding (2025)Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, et al.5.04
- Empowering Morphing Attack Detection Using Interpretable Image-text Foundation Model (2025)Sushrut Patwardhan, Raghavendra Ramachandra, Sushma Venkatesh5.04
- Agentic AI With Orchestrator-agent Trust: A Modular Visual Classification Framework With Trust-aware Orchestration And Rag-based Reasoning (2025)Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, et al.5.04
- Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins (2026)Yiqing Shen et al.5.01
- Molorag: Bootstrapping Document Understanding Via Multi-modal Logic-aware Retrieval (2025)Xixi Wu, Yanchao Tan, Nan Hou, et al.4.88
- RICO: Improving Accuracy And Completeness In Image Recaptioning Via Visual Reconstruction (2025)Yuchi Wang, Yishuo Cai, Shuhuai Ren, et al.4.83
- Exploring Cross-Modal Flows for Few-Shot Learning (2025)Ziqi Jiang et al.4.75
- Learning Contrastive Multimodal Fusion With Improved Modality Dropout For Disease Detection And Prediction (2025)Yi Gu, Kuniaki Saito, Jiaxin Ma4.72
- CLIMP: Contrastive Language-Image Mamba Pretraining (2026)Nimrod Shabtay et al.4.70
- Multimodal Referring Segmentation: A Survey (2025)Henghui Ding, Song Tang, Shuting He, et al.4.70
- Clip-handid: Vision-language Model For Hand-based Person Identification (2025)Nathanael L. Baisa, Babu Pallam, Amudhavel Jayavel4.53
- Scale, Don't Fine-tune: Guiding Multimodal Llms For Efficient Visual Place Recognition At Test-time (2025)Jintao Cheng, Weibin Li, Jiehao Luo, et al.4.53
- Representation Discrepancy Bridging Method For Remote Sensing Image-text Retrieval (2025)Hailong Ning, Siying Wang, Tao Lei, et al.4.53
- Opensplat3d: Open-vocabulary 3D Instance Segmentation Using Gaussian Splatting (2025)Jens Piekenbrinck, Christian Schmidt, Alexander Hermans, et al.4.53
- Modality-aware Infrared And Visible Image Fusion With Target-aware Supervision (2025)Tianyao Sun, Dawei Xiang, Tianqi Ding, et al.4.53
- Core-mmrag: Cross-source Knowledge Reconciliation For Multimodal RAG (2025)Yang Tian, Fan Liu, Jingyuan Zhang, et al.4.53