Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning
2022 Β· Suvir Mirchandani, Licheng Yu, Mengjiao Wang, et al.
Abstract
Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design
Authors
(none)
Tags
Stats
Related papers
- Fashionvil: Fashion-focused Vision-and-language Representation Learning (2022)14.66
- Fashionfae: Fine-grained Attributes Enhanced Fashion Vision-language Pre-training (2024)0.00
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- Unifashion: A Unified Vision-language Model For Multimodal Fashion Retrieval And Generation (2024)10.66
- Training And Challenging Models For Text-guided Fashion Image Retrieval (2022)0.00
- Kaleido-bert: Vision-language Pre-training On Fashion Domain (2021)14.69
- Masked Vision-language Transformer In Fashion (2022)12.41
- A Hybrid Multimodal Deep Learning Framework For Intelligent Fashion Recommendation (2025)0.00