Fashionfae: Fine-grained Attributes Enhanced Fashion Vision-language Pre-training
2024 Β· Jiale Huang, Dehong Gao, Jinxia Zhang, et al.
Abstract
Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we propose a novel approach for the fashion domain, Fine-grained Attributes Enhanced VLP (FashionFAE), which focuses on the detailed characteristics of fashion data. An attribute-emphasized text prediction task is proposed to predict fine-grained attributes of the items. This forces the model to focus on the salient attributes from the text modality. Additionally, a novel attribute-promoted image reconstruction task is proposed, which further enhances the fine-grained ability of the model by leveraging the representative attributes from the image modality. Extensive experiments show that Fashion
Authors
(none)
Tags
Stats
Related papers
- Fashionvil: Fashion-focused Vision-and-language Representation Learning (2022)14.66
- Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning (2022)7.81
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- Facap: A Large-scale Fashion Dataset For Fine-grained Composed Image Retrieval (2025)0.00
- Training And Challenging Models For Text-guided Fashion Image Retrieval (2022)0.00
- Kaleido-bert: Vision-language Pre-training On Fashion Domain (2021)14.69
- Masked Vision-language Transformer In Fashion (2022)12.41
- Partial Visual-semantic Embedding: Fashion Intelligence System With Sensitive Part-by-part Learning (2022)0.00