Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
This addresses safety concerns in text-to-image generation for users of VAR models, representing a domain-specific advancement.
The paper tackles the problem of unsafe concept generation in visual autoregressive (VAR) models by proposing a novel erasure framework (VARE and S-VARE) that enables stable and precise concept removal while maintaining generation quality, achieving surgical erasure as demonstrated in experiments.
The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.