Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval
2025 Β· Rong-Cheng Tu, Wenhao Sun, Hanzhe You, et al.
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, <reference image, modification text, target image>
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Cotmr: Chain-of-thought Multi-scale Reasoning For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Reason-before-retrieve: One-stage Reflective Chain-of-thoughts For Training-free Zero-shot Composed Image Retrieval (2024)10.03
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Training-free Zero-shot Composed Image Retrieval With Local Concept Reranking (2023)0.00
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- Training-free Zero-shot Composed Image Retrieval Via Weighted Modality Fusion And Similarity (2024)5.84