Understanding The Effect Of Using Semantically Meaningful Tokens For Visual Representation Learning
2024 Β· Neha Kalibhat, Priyatham Kattakinda, Sumit Nawathe, et al.
Abstract
Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO
Authors
(none)
Tags
Stats
Related papers
- Analyzing Local Representations Of Self-supervised Vision Transformers (2023)0.00
- Boosting Vision Transformers For Image Retrieval (2022)15.28
- Decoupling The Role Of Data, Attention, And Losses In Multimodal Transformers (2021)13.88
- Descriminative-generative Custom Tokens For Vision-language Models (2025)0.00
- Billion-scale Pretraining With Vision Transformers For Multi-task Visual Representations (2021)9.23
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- One Trajectory, One Token: Grounded Video Tokenization Via Panoptic Sub-object Trajectory (2025)0.00
- Vista: Vision And Scene Text Aggregation For Cross-modal Retrieval (2022)14.31