FBCIR: Balancing Cross-modal Focuses In Composed Image Retrieval
2026 Β· Chenchen Zhao, Jianhuan Zhuo, Muxi Chen, et al.
Abstract
Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to
Authors
(none)
Tags
Stats
Related papers
- OFFSET: Segmentation-based Focus Shift Revision For Composed Image Retrieval (2025)5.84
- NCL-CIR: Noise-aware Contrastive Learning For Composed Image Retrieval (2025)2.26
- DAFM: Dynamic Adaptive Fusion For Multi-model Collaboration In Composed Image Retrieval (2025)0.00
- Improving Composed Image Retrieval Via Contrastive Learning With Scaling Positives And Negatives (2024)11.30
- A Sanity Check On Composed Image Retrieval (2026)0.00
- TMCIR: Token Merge Benefits Composed Image Retrieval (2025)0.00
- HINT: Composed Image Retrieval With Dual-path Compositional Contextualized Network (2026)0.78
- CSMCIR: Cot-enhanced Symmetric Alignment With Memory Bank For Composed Image Retrieval (2026)0.00