Beginning With You: Perceptual-initialization Improves Vision-language Representation And Alignment
2025 Β· Yang Hu, Runchen Wang, Stephen Chong Zhao, et al.
Abstract
We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to
Authors
(none)
Tags
Stats
Related papers
- Perception Encoder: The Best Visual Embeddings Are Not At The Output Of The Network (2025)6.71
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding (2025)0.00
- Elevating All Zero-shot Sketch-based Image Retrieval Through Multimodal Prompt Learning (2024)6.34
- Pretrain Like Your Inference: Masked Tuning Improves Zero-shot Composed Image Retrieval (2023)2.86
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Incremental Embedding Learning Via Zero-shot Translation (2020)8.09
- Visual Space Optimization For Zero-shot Learning (2019)0.00