CVJul 26, 2025

SCALAR: Scale-wise Controllable Visual Autoregressive Learning

arXiv:2507.19946v318 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This work addresses inefficiencies in controllable generation for VAR models, offering a domain-specific solution for visual generative modeling.

The paper tackles the challenge of controllable image synthesis in Visual Autoregressive (VAR) models by introducing SCALAR, a method with a Scale-wise Conditional Decoding mechanism that improves generation quality and control precision, as demonstrated through extensive experiments.

Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks. The code is released at https://github.com/AMAP-ML/SCALAR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes