Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models
This addresses security vulnerabilities in black-box AI models for users concerned with adversarial attacks, though it appears incremental as it builds on existing visual prompting techniques.
The paper tackles the problem of detecting hidden backdoors in black-box models by introducing BProm, a method that leverages visual prompting and class subspace inconsistency between clean and poisoned datasets. The results show BProm's effectiveness in identifying backdoors, as confirmed through extensive experiments.
Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.