Maskinversion: Localized Embeddings Via Optimization Of Explainability Maps
2024 Β· Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, et al.
Abstract
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradie
Authors
(none)
Tags
Stats
Related papers
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Seeing What Matters: Empowering CLIP With Patch Generation-to-selection (2025)5.24
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Liteembed: Adapting CLIP To Rare Classes (2026)0.00