Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees
This addresses security risks in multi-agent decision-making systems, though it is an incremental advance focusing on adversarial robustness in an existing L2D framework.
The paper tackles the vulnerability of two-stage Learning-to-Defer (L2D) systems to adversarial perturbations that manipulate query allocation, introducing novel attack strategies and proposing SARD, a convex learning algorithm with theoretical guarantees. Empirical results show SARD significantly improves robustness under attacks while maintaining strong clean performance.
Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation--causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategie--untargeted and targeted--which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.