Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks
This provides theoretical guarantees for robustness against backdoor attacks in deep learning, with implications for security in model deployment, though it is incremental in building on existing perturbation theory.
The paper derived exact formulas for minimal weight perturbations in deep networks needed to change outputs, showing they match multi-layer Lipschitz bounds in order, and applied this to backdoor attacks, proving compression thresholds that prevent attacks and demonstrating low-rank compression can activate latent backdoors without accuracy loss.
The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These single-layer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.