Decomposing And Interpreting Image Representations Via Text In Vits Beyond CLIP
2024 Β· Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi
Abstract
Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of dif
Authors
(none)
Tags
Stats
Related papers
- VITR: Augmenting Vision Transformers With Relation-focused Learning For Cross-modal Information Retrieval (2023)4.52
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Analyzing Local Representations Of Self-supervised Vision Transformers (2023)0.00
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Disentangling Visual And Written Concepts In CLIP (2022)11.29
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00