Masked Vision-language Transformer In Fashion
2022 Β· Ge-Peng Ji, Mingcheng Zhuge, Dehong Gao, et al.
Abstract
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
Authors
(none)
Tags
Stats
Code
Related papers
- Kaleido-bert: Vision-language Pre-training On Fashion Domain (2021)14.69
- Fashionvil: Fashion-focused Vision-and-language Representation Learning (2022)14.66
- Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning (2022)7.81
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- Unifashion: A Unified Vision-language Model For Multimodal Fashion Retrieval And Generation (2024)10.66
- Fashionfae: Fine-grained Attributes Enhanced Fashion Vision-language Pre-training (2024)0.00
- VLMAE: Vision-language Masked Autoencoder (2022)0.00
- EVE: Efficient Vision-language Pre-training With Masked Prediction And Modality-aware Moe (2023)7.50