Collaborative Threshold Watermarking
This addresses the need for secure and scalable attribution mechanisms in federated learning, which is an incremental improvement over existing watermarking methods.
The paper tackles the problem of proving model provenance in federated learning by introducing a collaborative threshold watermarking scheme that allows only coalitions of at least t clients to verify the watermark, and it demonstrates effectiveness with minimal accuracy loss and robustness against attacks at scale (K=128).
In federated learning (FL), $K$ clients jointly train a model without sharing raw data. Because each participant invests data and compute, clients need mechanisms to later prove the provenance of a jointly trained model. Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients as per-client watermarks dilute as $K$ grows, or give any individual client the ability to verify and potentially remove the watermark. We introduce $(t,K)$-threshold watermarking: clients collaboratively embed a shared watermark during training, while only coalitions of at least $t$ clients can reconstruct the watermark key and verify a suspect model. We secret-share the watermark key $τ$ so that coalitions of fewer than $t$ clients cannot reconstruct it, and verification can be performed without revealing $τ$ in the clear. We instantiate our protocol in the white-box setting and evaluate on image classification. Our watermark remains detectable at scale ($K=128$) with minimal accuracy loss and stays above the detection threshold ($z\ge 4$) under attacks including adaptive fine-tuning using up to 20% of the training data.