Referring Expression Instance Retrieval And A Strong End-to-end Baseline

Abstract

Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf\{Referring Expression Instance Retrieval (REIR)\}, which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we prop

Referring Expression Instance Retrieval And A Strong End-to-end Baseline

Abstract

Authors

Tags

Stats

Related papers