Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks
2023 Β· Xiao Han, Xiatian Zhu, Licheng Yu, et al.
Abstract
In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-t
Authors
(none)
Tags
Stats
Related papers
- Fashionvil: Fashion-focused Vision-and-language Representation Learning (2022)14.66
- Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning (2022)7.81
- Fashionfae: Fine-grained Attributes Enhanced Fashion Vision-language Pre-training (2024)0.00
- Unifashion: A Unified Vision-language Model For Multimodal Fashion Retrieval And Generation (2024)10.66
- Masked Vision-language Transformer In Fashion (2022)12.41
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- Training And Challenging Models For Text-guided Fashion Image Retrieval (2022)0.00
- Facap: A Large-scale Fashion Dataset For Fine-grained Composed Image Retrieval (2025)0.00