LGAICRMar 13, 2024

Learning to Watermark LLM-generated Text via Reinforcement Learning

arXiv:2403.10553v128 citationsh-index: 9Has Code
Originality Highly original
AI Analysis

This addresses the need for reliable detection of AI-generated content to prevent misuse, offering a novel approach that allows open-sourcing of watermarked models with low overhead.

The paper tackles the problem of watermarking LLM-generated text to track misuse by proposing a model-level watermark embedded into LLM weights, which is more accurate, robust, and adaptable to attacks compared to prior token-level methods.

We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: https://github.com/xiaojunxu/learning-to-watermark-llm .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes