IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval
2025 Β· Bangwei Liu, Yicheng Bao, Shaohui Lin, et al.
Abstract
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM)
Authors
(none)
Tags
Stats
Related papers
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Beyond Global Similarity: Towards Fine-grained, Multi-condition Multimodal Retrieval (2026)2.20
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- Deepimagesearch: Benchmarking Multimodal Agents For Context-aware Image Retrieval In Visual Histories (2026)0.00
- MRMR: A Realistic And Expert-level Multidisciplinary Benchmark For Reasoning-intensive Multimodal Retrieval (2025)0.00
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00