Descriminative-generative Custom Tokens For Vision-language Models
2025 Β· Pramuditha Perera, Matthew Trager, Luca Zancato, et al.
Abstract
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retriev
Authors
(none)
Tags
Stats
Related papers
- Calibclip: Contextual Calibration Of Dominant Semantics For Text-driven Image Retrieval (2025)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Meta-personalizing Vision-language Models To Find Named Instances In Video (2023)8.60
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Understanding The Effect Of Using Semantically Meaningful Tokens For Visual Representation Learning (2024)0.00
- Prompting Large Vision-language Models For Compositional Reasoning (2024)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Linear Spaces Of Meanings: Compositional Structures In Vision-language Models (2023)9.41