CL AIMay 6

SLAM: Structural Linguistic Activation Marking for Language Models

arXiv:2605.0544379.9h-index: 3

AI Analysis

For LLM developers needing detectable watermarks without text quality degradation, SLAM offers a novel approach that avoids token-distribution bias, though it is vulnerable to syntactic paraphrase.

SLAM introduces a watermarking scheme for LLMs that embeds marks in structural linguistic geometry via sparse autoencoders, achieving 100% detection accuracy with only 1-2 reward points quality loss (vs. 7.5-11.5 for prior methods) on Gemma-2 models.

LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection with measurable quality loss. We present SLAM (Structural Linguistic Activation Marking), a novel white-box watermarking scheme that sidesteps this cost by writing the mark into structural geometry rather than token frequencies: sparse autoencoders identify residual-stream directions encoding linguistic structure (e.g., voice, tense, clause order), and we causally steer those directions at generation time, leaving lexical sampling and semantics unconstrained. On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points - compared to 7.5-11.5 for KGW, EWD, and Unigram - with naturalness and diversity preserved at near-unwatermarked levels across both models. The trade-off is a complementary robustness profile: SLAM resists word-level edits but is vulnerable to paraphrase that restructures syntax (at a quality cost), the converse of token-distribution methods.

View on arXiv PDF

Similar