CVApr 14, 2025

Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

arXiv:2504.10434v27 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient and flexible image editing for AR models, bridging the performance gap with diffusion models, though it appears incremental as it adapts editing strategies to a different model type.

The paper tackles the problem of image editing in autoregressive (AR) models, which suffer from structural errors and poor attention maps, by introducing Implicit Structure Locking (ISLock) with Anchor Token Matching, achieving high-quality, structure-consistent edits without additional training and showing superiority or comparability to conventional techniques.

Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes