Turning Adversaries Into Allies: Reversing Typographic Attacks For Multimodal E-commerce Product Retrieval
2025 Β· Janet Jenq, Hongda Shen
Abstract
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product
Authors
(none)
Tags
Stats
Related papers
- Extending CLIP For Category-to-image Retrieval In E-commerce (2021)8.60
- Bringing Multimodality To Amazon Visual Search System (2024)6.34
- VL-CLIP: Enhancing Multimodal Recommendations Via Visual Grounding And Llm-augmented CLIP Embeddings (2025)2.26
- Captions Are Worth A Thousand Words: Enhancing Product Retrieval With Pretrained Image-to-text Models (2024)0.00
- ACE-BERT: Adversarial Cross-modal Enhanced BERT For E-commerce Retrieval (2021)0.00
- Optimizing Product Deduplication In E-commerce With Multimodal Embeddings (2025)0.00
- Unified Vision-language Representation Modeling For E-commerce Same-style Products Retrieval (2023)6.34
- Semantic-enhanced Modality-asymmetric Retrieval For Online E-commerce Search (2025)0.00