Referring Expression Instance Retrieval And A Strong End-to-end Baseline
2025 Β· Xiangzhao Hao, Kuan Zhu, Hongyu Guo, et al.
Abstract
Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf\{Referring Expression Instance Retrieval (REIR)\}, which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we prop
Authors
(none)
Tags
Stats
Related papers
- Resedis: A Dataset For Referring-based Object Search Across Large-scale Image Collections (2025)0.00
- Instance-level Image Retrieval Using Reranking Transformers (2021)19.00
- Intrec: Intent-based Retrieval With Contrastive Refinement (2026)0.00
- Composed Object Retrieval: Object-level Retrieval Via Composed Expressions (2025)1.91
- Instruct-reid++: Towards Universal Purpose Instruction-guided Person Re-identification (2024)9.13
- SORCE: Small Object Retrieval In Complex Environments (2025)0.00
- Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval (2025)6.77
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29