\(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment
2025 Β· Fatimah Zohra, Chen Zhao, Hani Itani, et al.
Abstract
CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose \(\beta\)-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, \(\beta\)-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the \(\beta\)-Contextualized Contrastive Alignment Loss (\(\beta\)-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. We find that each loss interacts
Authors
(none)
Tags
Stats
Related papers
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- Superclip: CLIP With Simple Classification Supervision (2025)0.00
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12