Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval
2025 Β· Zehong Ma, Hao Chen, Wei Zeng, et al.
Abstract
Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-base
Authors
(none)
Tags
Stats
Related papers
- Fast-then-fine: A Two-stage Framework With Multi-granular Representation For Cross-modal Retrieval In Remote Sensing (2026)0.00
- Category-oriented Representation Learning For Image To Multi-modal Retrieval (2023)0.00
- Multi-modal Reasoning Graph For Scene-text Based Fine-grained Image Classification And Retrieval (2020)11.29
- Composed Image Retrieval With Text Feedback Via Multi-grained Uncertainty Regularization (2022)0.00
- Multi-path Exploration And Feedback Adjustment For Text-to-image Person Retrieval (2024)0.00
- Mr. Right: Multimodal Retrieval On Representation Of Image With Text (2022)0.00
- Bi-directional Training For Composed Image Retrieval Via Text Prompt Learning (2023)15.63
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52