Recall: Recalibrating Capability Degradation For Mllm-based Composed Image Retrieval
2026 Β· Tianyu Yang, Chenwei He, Xiangzhao Hao, et al.
Abstract
Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. While adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction, we identify that this strategy overlooks a fundamental issue: compressing a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline: First, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by prompting the foundation MLLM and conduct quality c
Authors
(none)
Tags
Stats
Related papers
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Mcot-mvs: Multi-level Vision Selection By Multi-modal Chain-of-thought Reasoning For Composed Image Retrieval (2026)0.00
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- Compositional Image Retrieval Via Instruction-aware Contrastive Learning (2024)0.00
- Generative Cross-modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond (2024)8.35
- Visual Delta Generator With Large Multi-modal Models For Semi-supervised Composed Image Retrieval (2024)9.03