CVAICLLGJul 10, 2024

PaliGemma: A versatile 3B VLM for transfer

DeepMindOxford
arXiv:2407.07726v2660 citationsh-index: 52
Originality Synthesis-oriented
AI Analysis

This provides an open, broadly applicable VLM for researchers and practitioners, though it appears incremental as it builds on existing components.

The authors introduced PaliGemma, a 3B parameter vision-language model based on SigLIP-So400m and Gemma-2B, trained as a versatile base model for transfer, achieving strong performance on nearly 40 diverse tasks including standard benchmarks and specialized domains like remote-sensing and segmentation.

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes