One Perturbation, Two Failure Modes: Probing VLM Safety Via Embedding-guided Typographic Perturbations

·2026

arXiv:balakrishnan2026one ↗Google Scholar ↗Semantic Scholar ↗

Abstract

arXiv:2604.25102v1 Announce Type: new Abstract: Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain *why* certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR (\(r\{=\}\{-\}0.71\) to \(\{-\}0.93\), \(p\{<\}0.01\)), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding simi

Abstract

Related papers