SETR: A Two-stage Semantic-enhanced Framework For Zero-shot Composed Image Retrieval
2025 Β· Yuqi Xiao, Yingying Zhu
Abstract
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic rele
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Zero-shot Composed Image Retrieval With Textual Inversion (2023)19.84
- Isearle: Improving Textual Inversion For Zero-shot Composed Image Retrieval (2024)12.09
- WISER: Wider Search, Deeper Thinking, And Adaptive Fusion For Training-free Zero-shot Composed Image Retrieval (2026)2.98
- Data-efficient Generalization For Zero-shot Composed Image Retrieval (2025)2.26
- Knowledge-enhanced Dual-stream Zero-shot Composed Image Retrieval (2024)11.08
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00