Fashionvil: Fashion-focused Vision-and-language Representation Learning
2022 Β· Xiao Han, Licheng Yu, Xiatian Zhu, et al.
Abstract
Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both the fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L data point contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts
Authors
(none)
Tags
Stats
Related papers
- Fashionfae: Fine-grained Attributes Enhanced Fashion Vision-language Pre-training (2024)0.00
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning (2022)7.81
- Masked Vision-language Transformer In Fashion (2022)12.41
- Kaleido-bert: Vision-language Pre-training On Fashion Domain (2021)14.69
- Facap: A Large-scale Fashion Dataset For Fine-grained Composed Image Retrieval (2025)0.00
- Unifashion: A Unified Vision-language Model For Multimodal Fashion Retrieval And Generation (2024)10.66
- Partial Visual-semantic Embedding: Fashion Intelligence System With Sensitive Part-by-part Learning (2022)0.00