Mr. Right: Multimodal Retrieval On Representation Of Image With Text
2022 Β· Cheng-An Hsieh, Cheng-Ping Hsieh, Pu-Jen Cheng
Abstract
Multimodal learning is a recent challenge that extends unimodal learning by generalizing its domain to diverse modalities, such as texts, images, or speech. This extension requires models to process and relate information from multiple modalities. In Information Retrieval, traditional retrieval tasks focus on the similarity between unimodal documents and queries, while image-text retrieval hypothesizes that most texts contain the scene context from images. This separation has ignored that real-world queries may involve text content, image captions, or both. To address this, we introduce Multimodal Retrieval on Representation of ImaGe witH Text (Mr. Right), a novel and comprehensive dataset for multimodal retrieval. We utilize the Wikipedia dataset with rich text-image examples and generate three types of text-based queries with different modality information: text-related, image-related, and mixed. To validate the effectiveness of our dataset, we provide a multimodal training paradigm
Authors
(none)
Tags
Stats
Related papers
- MRMR: A Realistic And Expert-level Multidisciplinary Benchmark For Reasoning-intensive Multimodal Retrieval (2025)0.00
- Category-oriented Representation Learning For Image To Multi-modal Retrieval (2023)0.00
- Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval (2025)6.77
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00
- Entity Image And Mixed-modal Image Retrieval Datasets (2025)1.56