Telling The What While Pointing To The Where: Multimodal Queries For Image Retrieval
2021 Β· Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, et al.
Abstract
Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an image retrieval setup with a new form of multimodal queries, where the user simultaneously uses both spoken natural language (the what) and mouse traces over an empty canvas (the where) to express the characteristics of the desired target image. We then describe simple modifications to an existing image retrieval model, enabling it to operate in this setup. Qualitative and quantitative experiments show that our model effectively takes this spatial guidance into account, and provides significantly more accurate retrieval results compared to text-only equivalent systems.
Authors
(none)
Tags
Stats
Related papers
- Embedding Arithmetic Of Multimodal Queries For Image Retrieval (2021)9.03
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- You'll Never Walk Alone: A Sketch And Text Duet For Fine-grained Image Retrieval (2024)9.41
- Recqr: Incorporating Conversational Query Rewriting To Improve Multimodal Image Retrieval (2026)0.00
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- Ask&confirm: Active Detail Enriching For Cross-modal Retrieval With Partial Query (2021)11.68
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Mr. Right: Multimodal Retrieval On Representation Of Image With Text (2022)0.00