AIMay 7

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

arXiv:2605.0597742.4
AI Analysis

This work addresses the critical need for practical backdoor defenses in DRL, offering the first online defense that works without trigger knowledge or costly fine-tuning.

BehaviorGuard proposes an online, trigger-agnostic defense against backdoor attacks in deep reinforcement learning by detecting behavioral drift in action distributions, achieving superior efficacy and efficiency over prior methods across single- and multi-agent benchmarks.

Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single- and multi-agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes