EDIS: Entity-driven Image Search Over Multimodal Web Content

Abstract

Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf\{E\}ntity-\textbf\{D\}riven \textbf\{I\}mage \textbf\{S\}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively

EDIS: Entity-driven Image Search Over Multimodal Web Content

Abstract

Authors

Tags

Stats

Related papers