CVNov 18, 2024

LaVin-DiT: Large Vision Diffusion Transformer

arXiv:2411.11505v428 citationsh-index: 22CVPR
Originality Highly original
AI Analysis

This work addresses the problem of inefficient and disrupted spatial relationships in existing large vision models for computer vision researchers and practitioners, offering a novel generative framework.

The paper tackles the challenge of building a scalable foundation model for over 20 computer vision tasks by introducing LaVin-DiT, which uses a spatial-temporal variational autoencoder and joint diffusion transformer to achieve state-of-the-art performance across diverse tasks without fine-tuning.

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models are available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes