VL-CLIP: Enhancing Multimodal Recommendations Via Visual Grounding And Llm-augmented CLIP Embeddings
2025 Β· Ramin Giahi, Kehui Yao, Sriram Kollipara, et al.
Abstract
Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual fea
Authors
(none)
Tags
Stats
Related papers
- Extending CLIP For Category-to-image Retrieval In E-commerce (2021)8.60
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Pinclip: Large-scale Foundational Multimodal Representation At Pinterest (2026)0.00
- Improving Visual Recommendation On E-commerce Platforms Using Vision-language Models (2025)0.00
- Vlm4rec: Multimodal Semantic Representation For Recommendation With Large Vision-language Models (2026)1.82
- AFMRL: Attribute-enhanced Fine-grained Multi-modal Representation Learning In E-commerce (2026)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00