\(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment

Abstract

CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose \(\beta\)-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, \(\beta\)-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the \(\beta\)-Contextualized Contrastive Alignment Loss (\(\beta\)-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. We find that each loss interacts

\(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment

Abstract

Authors

Tags

Stats

Related papers