HATS: High-Accuracy Triple-Set Watermarking for Large Language Models
This addresses the need for reliable watermarking to curb misuse of LLM outputs, though it appears incremental as it builds on existing watermarking methods with a novel partitioning scheme.
The paper tackles the problem of misuse of LLM-generated text by proposing a triple-set watermarking technique that partitions the vocabulary into Green, Yellow, and Red sets during decoding, achieving high detection accuracy with fixed false-positive rates while maintaining text readability.
Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.