Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs

Abstract

As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf\{FineLIP\}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf\{Fine\}-grained alignment with \textbf\{L\}onger text input within the CL\textbf\{IP\}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggr

Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs

Abstract

Authors

Tags

Stats

Related papers