50.2CLMay 13
Distribution Corrected Offline Data Distillation for Large Language ModelsYumeng Zhang, Zhengbang Yang, Yevin Nikhel Goonatilake et al.
Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.
76.0CRMay 9
Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark RemovalYevin Nikhel Goonatilake, Giuseppe Ateniese
Watermarks for AI-generated images are meant to support downstream decisions about provenance, manipulation, and trust. In the settings that motivate watermark removal, therefore, success means more than causing the watermark test to fail. A successful remover must also preserve the utility of the image and make the output forensically indistinguishable from clean content, so that defeating the verifier restores deniability rather than merely replacing one detection signal with another. We show that current watermark removal attacks fail this stronger objective. Across six state-of-the-art removers spanning four attack families, independent forensic detectors distinguish removal-processed outputs from clean images at over 98% true-positive rate under a 1% false-positive budget. Thus, current removers often replace the watermark with a different detectable signal. Using UnMarker (IEEE S&P 2025) as a detailed case study, we show that this signal persists under common post-processing, exhibits a characteristic two-regime spectral deformation, and yields a three-way tension among removal success, image quality, and forensic stealth. These results show that existing removal benchmarks are incomplete: they reward verifier evasion and utility preservation while omitting forensic stealth. A workable watermark remover must satisfy all three conditions at once: watermark evasion, utility preservation, and forensic indistinguishability from clean content.
CRSep 11, 2025
The Coding Limits of Robust Watermarking for Generative ModelsDanilo Francati, Yevin Nikhel Goonatilake, Shubham Pawar et al.
We prove a sharp threshold for the robustness of cryptographic watermarking for generative models. This is achieved by introducing a coding abstraction, which we call messageless secret-key codes, that formalizes sufficient and necessary requirements of robust watermarking: soundness, tamper detection, and pseudorandomness. Thus, we establish that robustness has a precise limit: For binary outputs no scheme can survive if more than half of the encoded bits are modified, and for an alphabet of size q the corresponding threshold is $(1-1/q)$ of the symbols. Complementing this impossibility, we give explicit constructions that meet the bound up to a constant slack. For every $δ > 0$, assuming pseudorandom functions and access to a public counter, we build linear-time codes that tolerate up to $(1/2)(1-δ)$ errors in the binary case and $(1-1/q)(1-δ)$ errors in the $q$-ary case. Together with the lower bound, these yield the maximum robustness achievable under standard cryptographic assumptions. We then test experimentally whether this limit appears in practice by looking at the recent watermarking for images of Gunn, Zhao, and Song (ICLR 2025). We show that a simple crop and resize operation reliably flipped about half of the latent signs and consistently prevented belief-propagation decoding from recovering the codeword, erasing the watermark while leaving the image visually intact. These results provide a complete characterization of robust watermarking, identifying the threshold at which robustness fails, constructions that achieve it, and an experimental confirmation that the threshold is already reached in practice.