CVMar 1, 2024

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

arXiv:2403.00522v236 citationsh-index: 29Has CodeECCV
AI Analysis

This provides a unified and generic modeling framework for most vision tasks, potentially serving as a strong new baseline for vision generation and understanding.

The paper introduces VisionLLaMA, a LLaMA-like transformer backbone adapted for processing 2D images, which shows substantial gains over previous state-of-the-art vision transformers in tasks like image perception and generation.

Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code is released at https://github.com/Meituan-AutoML/VisionLLaMA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes