FG-CLIP: Fine-grained Visual And Textual Alignment
2025 Β· Chunyu Xie, Bin Wang, Fanjing Kong, et al.
Abstract
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples. Correspondin
Authors
(none)
Tags
Stats
Related papers
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- \(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment (2025)2.16
- DGTRSD & DGTRS-CLIP: A Dual-granularity Remote Sensing Image-text Dataset And Vision Language Foundation Model For Alignment (2025)2.98
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Superclip: CLIP With Simple Classification Supervision (2025)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- FLAIR: VLM With Fine-grained Language-informed Image Representations (2024)10.14
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00