CV AI CL LGJul 10, 2024

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers

DeepMindOxford

arXiv:2407.07726v252.0700 citationsh-index: 52Has Code

Originality Synthesis-oriented

AI Analysis

This provides an open, broadly applicable VLM for researchers and practitioners, though it appears incremental as it builds on existing components.

The authors introduced PaliGemma, a 3B parameter vision-language model based on SigLIP-So400m and Gemma-2B, trained as a versatile base model for transfer, achieving strong performance on nearly 40 diverse tasks including standard benchmarks and specialized domains like remote-sensing and segmentation.

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

View on arXiv PDF Code

Similar