Pali-3 Vision Language Models: Smaller, Faster, Stronger
2023 Β· Xi Chen, Xiao Wang, Lucas Beyer, et al.
Abstract
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.
Authors
(none)
Tags
Stats
Related papers
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Med3dvlm: An Efficient Vision-language Model For 3D Medical Image Analysis (2025)12.60
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Vilbert: Pretraining Task-agnostic Visiolinguistic Representations For Vision-and-language Tasks (2019)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Babel-imagenet: Massively Multilingual Evaluation Of Vision-and-language Representations (2023)2.76