GOAL: Global-local Object Alignment Learning
2025 Β· Hyungyu Choi, Young Kyun Jang, Chanho Eom
Abstract
Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through e
Authors
(none)
Tags
Stats
Related papers
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Text-video Retrieval With Global-local Semantic Consistent Learning (2024)8.75
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- DGTRSD & DGTRS-CLIP: A Dual-granularity Remote Sensing Image-text Dataset And Vision Language Foundation Model For Alignment (2025)2.98
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90