Unifashion: A Unified Vision-language Model For Multimodal Fashion Retrieval And Generation
2024 Β· Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, et al.
Abstract
The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks b
Authors
(none)
Tags
Stats
Related papers
- Fad-vlp: Fashion Vision-and-language Pre-training Towards Unified Retrieval And Captioning (2022)7.81
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- Fashion-rag: Multimodal Fashion Image Editing Via Retrieval-augmented Generation (2025)4.52
- Mmfl-net: Multi-scale And Multi-granularity Feature Learning For Cross-domain Fashion Retrieval (2022)5.84
- Fashionvil: Fashion-focused Vision-and-language Representation Learning (2022)14.66
- Tiger: Unifying Text-to-image Generation And Retrieval With Large Multimodal Models (2024)0.00
- A Hybrid Multimodal Deep Learning Framework For Intelligent Fashion Recommendation (2025)0.00
- Masked Vision-language Transformer In Fashion (2022)12.41