Unicvr: From Alignment To Reranking For Unified Zero-shot Composed Visual Retrieval
2026 Β· Haokun Wen, Xuemeng Song, Haoyu Zhang, et al.
Abstract
Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embeddi
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Cotmr: Chain-of-thought Multi-scale Reasoning For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- SETR: A Two-stage Semantic-enhanced Framework For Zero-shot Composed Image Retrieval (2025)0.00