Perception Encoder: The Best Visual Embeddings Are Not At The Output Of The Network
2025 Β· Daniel Bolya, Po-Yao Huang, Peize Sun, et al.
Abstract
We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simu
Authors
(none)
Tags
Stats
Related papers
- Give: Guiding Visual Encoder To Perceive Overlooked Information (2024)0.00
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Beginning With You: Perceptual-initialization Improves Vision-language Representation And Alignment (2025)0.00
- Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding (2025)0.00
- Efficient Discriminative Joint Encoders For Large Scale Vision-language Reranking (2025)0.00
- Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions (2026)0.00
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Human-aligned Image Models Improve Visual Decoding From The Brain (2025)0.00