CLAILGFeb 18, 2025

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

arXiv:2502.13141v115 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLMs for users and developers, but it is incremental as it builds on existing attack detection methods by unifying them.

The paper tackles the problem of detecting prompt injection, backdoor attacks, and adversarial attacks in Large Language Models by proposing UniGuardian, a unified defense mechanism that accurately identifies malicious prompts with a single-forward strategy for efficiency.

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes