Beyond Global Similarity: Towards Fine-grained, Multi-condition Multimodal Retrieval
2026 Β· Xuan Lu, Kangle Li, Haohang Huang, et al.
Abstract
Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal r
Authors
(none)
Tags
Stats
Related papers
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- MRMR: A Realistic And Expert-level Multidisciplinary Benchmark For Reasoning-intensive Multimodal Retrieval (2025)0.00
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- MM-BRIGHT: A Multi-task Multimodal Benchmark For Reasoning-intensive Retrieval (2026)2.60