Missing Target-relevant Information Prediction With World Model For Accurate Zero-shot Composed Image Retrieval
2025 Β· Yuanmin Tang, Jing Yu, Keke Gai, et al.
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating te
Authors
(none)
Tags
Stats
Related papers
- Context-i2w: Mapping Images To Context-dependent Words For Accurate Zero-shot Composed Image Retrieval (2023)15.41
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Data-efficient Generalization For Zero-shot Composed Image Retrieval (2025)2.26
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00
- Pretrain Like Your Inference: Masked Tuning Improves Zero-shot Composed Image Retrieval (2023)2.86
- Knowledge-enhanced Dual-stream Zero-shot Composed Image Retrieval (2024)11.08