Category-oriented Representation Learning For Image To Multi-modal Retrieval
2023 Β· Zida Cheng, Chen Ju, Shuai Xiao, et al.
Abstract
The rise of multi-modal search requests from users has highlighted the importance of multi-modal retrieval (i.e. image-to-text or text-to-image retrieval), yet the more complex task of image-to-multi-modal retrieval, crucial for many industry applications, remains under-explored. To address this gap and promote further research, we introduce and define the concept of Image-to-Multi-Modal Retrieval (IMMR), a process designed to retrieve rich multi-modal (i.e. image and text) documents based on image queries. We focus on representation learning for IMMR and analyze three key challenges for it: 1) skewed data and noisy label in real-world industrial data, 2) the information-inequality between image and text modality of documents when learning representations, 3) effective and efficient training in large-scale industrial contexts. To tackle the above challenges, we propose a novel framework named organizing categories and learning by classification for retrieval (OCLEAR). It consists of th
Authors
(none)
Tags
Stats
Related papers
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- Composed Multi-modal Retrieval: A Survey Of Approaches And Applications (2025)3.88
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval (2025)6.77
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Mr. Right: Multimodal Retrieval On Representation Of Image With Text (2022)0.00
- Commercemm: Large-scale Commerce Multimodal Representation Learning With Omni Retrieval (2022)0.00
- A Unified Optimal Transport Framework For Cross-modal Retrieval With Noisy Labels (2024)5.24