When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks

Donald Flynn, Hadas Yaron Goldhirsh, Jonathan P. Keating, Inbar Seroussi

arXiv:2605.224818.8

Predicted impact top 60% in LG · last 90 daysOriginality Incremental advance

AI Analysis

Provides a theoretical foundation for understanding backdoor attacks in high dimensions, revealing counter-intuitive behaviors that challenge existing intuitions and can guide defense design.

The paper theoretically analyzes backdoor attacks in high-dimensional settings, showing that stronger training triggers can paradoxically improve clean accuracy and reduce attack success beyond a finite trigger strength. Key results include closed-form proofs for squared loss and extensions to general convex losses, with experiments on CIFAR-10 and ResNet-18 confirming the phenomena.

Backdoor poisoning attacks behave counter-intuitively in high dimensions: stronger training triggers can help the defender. We study regularised generalised linear models on Gaussian-mixture data in the proportional regime ($p/n \to κ$), varying the training trigger strength $α$ against a fixed test trigger. Three phenomena emerge: (i) clean test accuracy increases with $α$; (ii) attack success peaks at a finite $α$ and then declines; and (iii) the most damaging trigger direction is the minimum eigenvector of the data covariance. We prove all three results in closed form for the squared loss, and extend (i) and (ii) to general convex GLM losses via a Gaussian-proxy fixed-point system. We identify a finite-sample noise floor proportional to $κ$ as the mechanism behind (i), invisible to classical $n \gg p$ analysis. Experiments on CIFAR-10 and Gaussian surrogates match the theory closely; ResNet-18 experiments show the same phenomena beyond the convex setting.

View on arXiv PDF

Similar