Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs
2025 Β· Mothilal Asokan, Kebin Wu, Fatima Albreiki
Abstract
As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf\{FineLIP\}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf\{Fine\}-grained alignment with \textbf\{L\}onger text input within the CL\textbf\{IP\}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggr
Authors
(none)
Tags
Stats
Related papers
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- TULIP: Token-length Upgraded CLIP (2024)3.04
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Lotlip: Improving Language-image Pre-training For Long Text Understanding (2024)2.26
- FLAIR: VLM With Fine-grained Language-informed Image Representations (2024)10.14
- GOAL: Global-local Object Alignment Learning (2025)2.26