Fashionmv: Product-level Composed Image Retrieval With Multi-view Fashion Data
2026 · Peng Yuan, Bingyin Mei, Hui Zhang
Abstract
Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level -- a single reference image plus modification text in, a single target image out -- while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms -- two-stage dialogue, ca
Authors
(none)
Tags
Stats
Related papers
- FIRE-CIR: Fine-grained Reasoning For Composed Fashion Image Retrieval (2026)0.00
- Facap: A Large-scale Fashion Dataset For Fine-grained Composed Image Retrieval (2025)0.00
- Methods And Advancement Of Content-based Fashion Image Retrieval: A Review (2023)0.00
- Training And Challenging Models For Text-guided Fashion Image Retrieval (2022)0.00
- Instance-level Composed Image Retrieval (2025)0.00
- Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning (2022)7.81
- Finecir: Explicit Parsing Of Fine-grained Modification Semantics For Composed Image Retrieval (2025)2.16
- Infocir: Multimedia Analysis For Composed Image Retrieval (2026)1.24