XR: Cross-modal Agents For Composed Image Retrieval
2026 Β· Zhongyu Yang, Wei Pang, Yingfang Yuan
Abstract
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to me
Authors
(none)
Tags
Stats
Related papers
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Infocir: Multimedia Analysis For Composed Image Retrieval (2026)1.24
- Composed Multi-modal Retrieval: A Survey Of Approaches And Applications (2025)3.88
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- Reason-before-retrieve: One-stage Reflective Chain-of-thoughts For Training-free Zero-shot Composed Image Retrieval (2024)10.03
- Mcot-mvs: Multi-level Vision Selection By Multi-modal Chain-of-thought Reasoning For Composed Image Retrieval (2026)0.00
- Cala: Complementary Association Learning For Augmenting Composed Image Retrieval (2024)9.41