An Analysis Of Vision-language Models For Fabric Retrieval
2025 Β· Francesco Giuliari, Asif Khan Pattan, Mohamed Lamine Mekhalfi, et al.
Abstract
Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta's Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly enhance retrieval accuracy, particularly for visually complex fabric classes, with the Perception Encode
Authors
(none)
Tags
Stats
Related papers
- Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation (2024)0.00
- VL-CLIP: Enhancing Multimodal Recommendations Via Visual Grounding And Llm-augmented CLIP Embeddings (2025)2.26
- Evdclip: Improving Vision-language Retrieval With Entity Visual Descriptions From Large Language Models (2025)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00