CVAIMar 6

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

arXiv:2603.05921v1h-index: 12Has Code
Predicted impact top 6% in CV · last 90 daysOriginality Highly original
AI Analysis

This work provides a plug-and-play, training-free solution for Model-as-a-Service (MaaS) providers to detect backdoors in text-to-image models, addressing a critical security concern for users of these services.

This paper addresses the problem of detecting backdoored text-to-image models in black-box settings, particularly for new attacks where triggered generations are visually diverse. The proposed framework, BlackMirror, identifies semantic deviations by aligning visual patterns with instructions and verifying the stability of these deviations across prompts, achieving accurate detection across various attacks.

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes