CVSep 2, 2025

Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

arXiv:2509.01984v25 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the need for practical image editing tools in real-world applications, though it is incremental as it adapts existing methods to a specific model type.

The paper tackles the problem of text-guided image editing without extra training by introducing VARIN, a noise inversion technique for visual autoregressive models, which achieves precise edits while preserving background and structural details.

Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes