Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning
2025 Β· Weihuang Lin, Yiwei Ma, Jiayi Ji, et al.
Abstract
Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Si
Authors
(none)
Tags
Stats
Related papers
- Cotmr: Chain-of-thought Multi-scale Reasoning For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Mcot-mvs: Multi-level Vision Selection By Multi-modal Chain-of-thought Reasoning For Composed Image Retrieval (2026)0.00
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Reason-before-retrieve: One-stage Reflective Chain-of-thoughts For Training-free Zero-shot Composed Image Retrieval (2024)10.03
- FIRE-CIR: Fine-grained Reasoning For Composed Fashion Image Retrieval (2026)0.00
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- Compositional Image Retrieval Via Instruction-aware Contrastive Learning (2024)0.00
- X-cot: Explainable Text-to-video Retrieval Via Llm-based Chain-of-thought Reasoning (2025)2.08