TMCIR: Token Merge Benefits Composed Image Retrieval
2025 Β· Chaoyang Wang, Zeyu Zhang, Long Teng, et al.
Abstract
Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textua
Authors
(none)
Tags
Stats
Related papers
- CSMCIR: Cot-enhanced Symmetric Alignment With Memory Bank For Composed Image Retrieval (2026)0.00
- HINT: Composed Image Retrieval With Dual-path Compositional Contextualized Network (2026)0.78
- Infocir: Multimedia Analysis For Composed Image Retrieval (2026)1.24
- DAFM: Dynamic Adaptive Fusion For Multi-model Collaboration In Composed Image Retrieval (2025)0.00
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Zero-shot Composed Text-image Retrieval (2023)0.00
- Cala: Complementary Association Learning For Augmenting Composed Image Retrieval (2024)9.41
- NCL-CIR: Noise-aware Contrastive Learning For Composed Image Retrieval (2025)2.26