LGAIApr 8, 2025

Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models

arXiv:2504.06446v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses transparency and accountability issues for users of large language models by providing a method to watermark generated text, though it is incremental as it builds on existing finetuning techniques.

The study tackled the problem of distinguishing AI-generated text by embedding a watermark directly into model weights through finetuning low-rank adapters, achieving a learned end-to-end strategy that balances watermark robustness, naturalness, and task performance.

The indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. While several methods exist to watermark models behind APIs, embedding watermark strategies directly into model weights that are later reflected in the outputs of the model is challenging. In this study we propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector, so that a subtle watermark is embedded into the text generated by the first model and simultaneously optimized for detectability by the second. In this way, the watermarking strategy is fully learned end-to-end. This process imposes an optimization challenge, as balancing watermark robustness, naturalness, and task performance requires trade-offs. We discuss strategies on how to optimize this min-max objective and present results showing the effect of this modification to instruction finetuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes