CL AI LGFeb 18, 2025

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao

arXiv:2502.13141v114.715 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This addresses security vulnerabilities in LLMs for users and developers, but it is incremental as it builds on existing attack detection methods by unifying them.

The paper tackles the problem of detecting prompt injection, backdoor attacks, and adversarial attacks in Large Language Models by proposing UniGuardian, a unified defense mechanism that accurately identifies malicious prompts with a single-forward strategy for efficiency.

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

View on arXiv PDF Code

Similar