CVAug 5, 2025

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

arXiv:2508.03320v116 citationsh-index: 14Has Code
Originality Highly original
AI Analysis

This work addresses the challenge of high-fidelity multimodal AI integration for practical deployment, offering a compact and efficient solution that is not incremental but establishes a new paradigm.

The paper tackles the problem of unifying image understanding, text-to-image generation, and image editing in a single model, achieving state-of-the-art performance with a 1.5 billion-parameter autoregressive model that operates efficiently on commodity hardware, such as generating 1024 x 1024 images with under 15 GB of GPU memory and setting records like a DPG-Bench score of 85.5.

We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes