Goldiclip: The Goldilocks Approach For Balancing Explicit Supervision For Language-image Pretraining
2026 Β· Deen Dayal Mohan, Hossein Souri, Vitali Petsiuk, et al.
Abstract
Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among d
Authors
(none)
Tags
Stats
Related papers
- Superclip: CLIP With Simple Classification Supervision (2025)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26