Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
This work addresses the need for tighter certified robustness against label-flipping attacks in neural networks, offering a more efficient and effective certification method for partition-aggregation ensembles.
The paper introduces EnsembleCert, the first certification framework for partition-aggregation ensembles that uses white-box knowledge of base classifiers to provide tighter robustness guarantees against label-flipping attacks. On CIFAR-10, it certifies up to 26.5% more label flips in median over the test set compared to existing black-box approaches, while requiring 100 times fewer partitions.
Label-flipping attacks, which corrupt training labels to induce misclassifications at inference, remain a major threat to supervised learning models. This drives the need for robustness certificates that provide formal guarantees about a model's robustness under adversarially corrupted labels. Existing certification frameworks rely on ensemble techniques such as smoothing or partition-aggregation, but treat the corresponding base classifiers as black boxes, yielding overly conservative guarantees. We introduce EnsembleCert, the first certification framework for partition-aggregation ensembles that utilizes white-box knowledge of the base classifiers. Concretely, EnsembleCert yields tighter guarantees than black-box approaches by aggregating per-partition white-box certificates to compute ensemble-level guarantees in polynomial time. To extract white-box knowledge from the base classifiers efficiently, we develop ScaLabelCert, a method that leverages the equivalence between sufficiently wide neural networks and kernel methods using the neural tangent kernel. ScaLabelCert yields the first exact, polynomial-time calculable certificate for neural networks against label-flipping attacks. EnsembleCert is either on par, or significantly outperforms the existing partition-based black box certificates. Exemplary, on CIFAR-10, our method can certify upto +26.5% more label flips in median over the test set compared to the existing black-box approach while requiring 100 times fewer partitions, thus, challenging the prevailing notion that heavy partitioning is a necessity for strong certified robustness.