Abstract

As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf\{FineLIP\}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf\{Fine\}-grained alignment with \textbf\{L\}onger text input within the CL\textbf\{IP\}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggr

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations6
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score6.34
  • arxiv keyasokan2025finelip

Related papers