CVApr 12

UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

Yigit Yilmaz, Elena Petrova, Mehmet Kaya, Lucia Rossi, Amir Rahman

arXiv:2604.1184368.7h-index: 7

Predicted impact top 51% in CV · last 90 daysOriginality Highly original

AI Analysis

This work addresses the need for multi-bit, secure, and architecture-agnostic watermarking in autoregressive image generation, which is critical for ownership protection and content tracing.

UniMark introduces a training-free, unified watermarking framework for autoregressive image generators that embeds multi-bit messages via adaptive semantic grouping and block-wise encoding, achieving state-of-the-art image quality (FID) and watermark detection accuracy while maintaining robustness against various attacks.

Invisible watermarking for autoregressive (AR) image generation has recently gained attention as a means of protecting image ownership and tracing AI-generated content. However, existing approaches suffer from three key limitations: (1) they embed only zero-bit watermarks for binary verification, lacking the ability to convey multi-bit messages; (2) they rely on static codebook partitioning strategies that are vulnerable to security attacks once the partition is exposed; and (3) they are designed for specific AR architectures, failing to generalize across diverse AR paradigms. We propose \method{}, a training-free, unified watermarking framework for autoregressive image generators that addresses all three limitations. \method{} introduces three core components: \textbf{Adaptive Semantic Grouping (ASG)}, which dynamically partitions codebook entries based on semantic similarity and a secret key, ensuring both image quality preservation and security; \textbf{Block-wise Multi-bit Encoding (BME)}, which divides the token sequence into blocks and encodes different bits across blocks with error-correcting codes for reliable message transmission; and \textbf{a Unified Token-Replacement Interface (UTRI)} that abstracts the watermark embedding process to support both next-token prediction (e.g., LlamaGen) and next-scale prediction (e.g., VAR) paradigms. We provide theoretical analysis on detection error rates and embedding capacity. Extensive experiments on three AR models demonstrate that \method{} achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction, while maintaining robustness against cropping, JPEG compression, Gaussian noise, blur, color jitter, and random erasing attacks.

View on arXiv PDF

Similar