VL-JEPA: Joint Embedding Predictive Architecture For Vision-language
2025 Β· Delong Chen, Mustafa Shukor, Theo Moutakanni, et al.
Abstract
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabula
Authors
(none)
Tags
Stats
Related papers
- VLMAE: Vision-language Masked Autoencoder (2022)0.00
- Efficient Discriminative Joint Encoders For Large Scale Vision-language Reranking (2025)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- Probvlm: Probabilistic Adapter For Frozen Vision-language Models (2023)13.41
- Leveraging Data To Say No: Memory Augmented Plug-and-play Selective Prediction (2026)0.78