84.2CRMay 27Code
AgentGuard: An Attribute-Based Access Control Framework for Tool-Use LLM-Based AgentJiaqi Luo, Songyang Peng, Jiarun Dai et al.
LLM-based agents have recently attracted significant attention due to their ability to autonomously invoke relevant tools to accomplish complex tasks. However, recent studies have shown that these agents face severe security risks, which may lead to privacy leakage, financial loss, or even full system compromise. In this paper, we present AgentGuard, an attribute-based access control framework for tool-use LLM-based agents. AgentGuard adopts a client-server architecture. On the client side, AgentGuard provides lightweight integration for agents implemented in different programming languages and architectures. It requires only minor code modifications (e.g., around 10 lines) without changing the underlying agent execution logic. On the server side, AgentGuard provides three complementary inspection mechanisms to cover both single-tool and cross-tool security risks in agent execution. In addition, it offers a visualized front-end interface for security policy specification and runtime auditing. Currently, AgentGuard is publicly accessible at https://github.com/WhitzardAgent/AgentGuard.
93.6AIMay 26Code
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought CorrectionChangyue Jiang, Wenqi Zhang, Xudong Pan et al.
LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.
91.8CRMay 25Code
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the FlyYihe Fan, Changyi Li, Lichen Xu et al.
LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.
CRMar 17, 2023Code
Exorcising ''Wraith'': Protecting LiDAR-based Object Detector in Automated Driving System from Appearing AttacksQifan Xiao, Xudong Pan, Yifan Lu et al.
Automated driving systems rely on 3D object detectors to recognize possible obstacles from LiDAR point clouds. However, recent works show the adversary can forge non-existent cars in the prediction results with a few fake points (i.e., appearing attack). By removing statistical outliers, existing defenses are however designed for specific attacks or biased by predefined heuristic rules. Towards more comprehensive mitigation, we first systematically inspect the mechanism of recent appearing attacks: Their common weaknesses are observed in crafting fake obstacles which (i) have obvious differences in the local parts compared with real obstacles and (ii) violate the physical relation between depth and point density. In this paper, we propose a novel plug-and-play defensive module which works by side of a trained LiDAR-based object detector to eliminate forged obstacles where a major proportion of local parts have low objectness, i.e., to what degree it belongs to a real object. At the core of our module is a local objectness predictor, which explicitly incorporates the depth information to model the relation between depth and point density, and predicts each local part of an obstacle with an objectness score. Extensive experiments show, our proposed defense eliminates at least 70% cars forged by three known appearing attacks in most cases, while, for the best previous defense, less than 30% forged cars are eliminated. Meanwhile, under the same circumstance, our defense incurs less overhead for AP/precision on cars compared with existing defenses. Furthermore, We validate the effectiveness of our proposed defense on simulation-based closed-loop control driving tests in the open-source system of Baidu's Apollo.
CLNov 1, 2023Code
JADE: A Linguistics-based Safety Evaluation Platform for Large Language ModelsMi Zhang, Xudong Pan, Min Yang
In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on more questions generated by JADE, please contact us. JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.
CRMar 17, 2023
Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural ObfuscationYifan Yan, Xudong Pan, Mi Zhang et al.
Copyright protection for deep neural networks (DNNs) is an urgent need for AI corporations. To trace illegally distributed model copies, DNN watermarking is an emerging technique for embedding and verifying secret identity messages in the prediction behaviors or the model internals. Sacrificing less functionality and involving more knowledge about the target DNN, the latter branch called \textit{white-box DNN watermarking} is believed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts in both the academy and the industry. In this paper, we present the first systematic study on how the mainstream white-box DNN watermarks are commonly vulnerable to neural structural obfuscation with \textit{dummy neurons}, a group of neurons which can be added to a target model but leave the model behavior invariant. Devising a comprehensive framework to automatically generate and inject dummy neurons with high stealthiness, our novel attack intensively modifies the architecture of the target model to inhibit the success of watermark verification. With extensive evaluation, our work for the first time shows that nine published watermarking schemes require amendments to their verification procedures.
CRJan 12Code
When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation AgentXinyi Wu, Geng Hong, Yueyue Chen et al.
Web agents, powered by large language models (LLMs), are increasingly deployed to automate complex web interactions. The rise of open-source frameworks (e.g., Browser Use, Skyvern-AI) has accelerated adoption, but also broadened the attack surface. While prior research has focused on model threats such as prompt injection and backdoors, the risks of social engineering remain largely unexplored. We present the first systematic study of social engineering attacks against web automation agents and design a pluggable runtime mitigation solution. On the attack side, we introduce the AgentBait paradigm, which exploits intrinsic weaknesses in agent execution: inducement contexts can distort the agent's reasoning and steer it toward malicious objectives misaligned with the intended task. On the defense side, we propose SUPERVISOR, a lightweight runtime module that enforces environment and intention consistency alignment between webpage context and intended goals to mitigate unsafe operations before execution. Empirical results show that mainstream frameworks are highly vulnerable to AgentBait, with an average attack success rate of 67.5% and peaks above 80% under specific strategies (e.g., trusted identity forgery). Compared with existing lightweight defenses, our module can be seamlessly integrated across different web automation frameworks and reduces attack success rates by up to 78.1% on average while incurring only a 7.7% runtime overhead and preserving usability. This work reveals AgentBait as a critical new threat surface for web agents and establishes a practical, generalizable defense, advancing the security of this rapidly emerging ecosystem. We reported the details of this attack to the framework developers and received acknowledgment before submission.
CRApr 30, 2022
Cracking White-box DNN Watermarks via Invariant Neuron TransformsYifan Yan, Xudong Pan, Yining Wang et al.
Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.
CRSep 7, 2023
Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited DataYifan Lu, Wenxuan Li, Mi Zhang et al.
To protect the intellectual property of well-trained deep neural networks (DNNs), black-box watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples and extracted from suspect models using only API access, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. However, current robustness evaluations are primarily performed under moderate attacks or unrealistic settings. Existing removal attacks could only crack a small subset of the mainstream black-box watermarks, and fall short in four key aspects: incomplete removal, reliance on prior knowledge of the watermark, performance degradation, and high dependency on data. In this paper, we propose a watermark-agnostic removal attack called \textsc{Neural Dehydration} (\textit{abbrev.} \textsc{Dehydra}), which effectively erases all ten mainstream black-box watermarks from DNNs, with only limited or even no data dependence. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss and achieve data-free watermark removal on five of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Dehydra} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with existing removal attacks, \textsc{Dehydra} achieves strong removal effectiveness across all the covered watermarks, preserving at least $90\%$ of the stolen model utility, under the data-limited settings, i.e., less than $2\%$ of the training data or even data-free.
MLJun 29, 2022
Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in ModelXudong Pan, Yifan Yan, Shengyao Zhang et al.
In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.
99.8CRMar 25
Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive SearchYulin Shen, Xudong Pan, Geng Hong et al.
Recent advances in the Model Context Protocol (MCP) have enabled large language models (LLMs) to invoke external tools with unprecedented ease. This creates a new class of powerful and tool augmented agents. Unfortunately, this capability also introduces an under explored attack surface, specifically the malicious manipulation of tool responses. Existing techniques for indirect prompt injection that target MCP suffer from high deployment costs, weak semantic coherence, or heavy white box requirements. Furthermore, they are often easily detected by recently proposed defenses. In this paper, we propose Tree structured Injection for Payloads (TIP), a novel black-box attack which generates natural payloads to reliably seize control of MCP enabled agents even under defense. Technically, We cast payload generation as a tree structured search problem and guide the search with an attacker LLM operating under our proposed coarse-to-fine optimization framework. To stabilize learning and avoid local optima, we introduce a path-aware feedback mechanism that surfaces only high quality historical trajectories to the attacker model. The framework is further hardened against defensive transformations by explicitly conditioning the search on observable defense signals and dynamically reallocating the exploration budget. Extensive experiments on four mainstream LLMs show that TIP attains over 95% attack success in undefended settings while requiring an order of magnitude fewer queries than prior adaptive attacks. Against four representative defense approaches, TIP preserves more than 50% effectiveness and significantly outperforms the state-of-the-art attacks. By implementing the attack on real world MCP systems, our results expose an invisible but practical threat vector in MCP deployments. We also discuss potential mitigation approaches to address this critical security gap.
AIJan 13
WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web AgentsXinyi Wu, Jiagui Chen, Geng Hong et al.
Web Agents are increasingly deployed to perform complex tasks in real web environments, yet their security evaluation remains fragmented and difficult to standardize. We present WebTrap Park, an automated platform for systematic security evaluation of Web Agents through direct observation of their concrete interactions with live web pages. WebTrap Park instantiates three major sources of security risk into 1,226 executable evaluation tasks and enables action based assessment without requiring agent modification. Our results reveal clear security differences across agent frameworks, highlighting the importance of agent architecture beyond the underlying model. WebTrap Park is publicly accessible at https://security.fudan.edu.cn/webagent and provides a scalable foundation for reproducible Web Agent security evaluation.
CLMay 22, 2025Code
ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning ModelsChangyi Li, Jiayi Wang, Xudong Pan et al.
Large Reasoning Models (LRMs) leverage transparent reasoning traces, known as Chain-of-Thoughts (CoTs), to break down complex problems into intermediate steps and derive final answers. However, these reasoning traces introduce unique safety challenges: harmful content can be embedded in intermediate steps even when final answers appear benign. Existing moderation tools, designed to handle generated answers, struggle to effectively detect hidden risks within CoTs. To address these challenges, we introduce ReasoningShield, a lightweight yet robust framework for moderating CoTs in LRMs. Our key contributions include: (1) formalizing the task of CoT moderation with a multi-level taxonomy of 10 risk categories across 3 safety levels, (2) creating the first CoT moderation benchmark which contains 9.2K pairs of queries and reasoning traces, including a 7K-sample training set annotated via a human-AI framework and a rigorously curated 2.2K human-annotated test set, and (3) developing a two-stage training strategy that combines stepwise risk analysis and contrastive learning to enhance robustness. Experiments show that ReasoningShield achieves state-of-the-art performance, outperforming task-specific tools like LlamaGuard-4 by 35.6% and general-purpose commercial models like GPT-4o by 15.8% on benchmarks, while also generalizing effectively across diverse reasoning paradigms, tasks, and unseen scenarios. All resources are released at https://github.com/CosmosYi/ReasoningShield.
CLDec 9, 2024
Frontier AI systems have surpassed the self-replicating red lineXudong Pan, Jiarun Dai, Yihe Fan et al.
Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs. That is why self-replication is widely recognized as one of the few red line risks of frontier AI systems. Nowadays, the leading AI corporations OpenAI and Google evaluate their flagship large language models GPT-o1 and Gemini Pro 1.0, and report the lowest risk level of self-replication. However, following their methodology, we for the first time discover that two AI systems driven by Meta's Llama31-70B-Instruct and Alibaba's Qwen25-72B-Instruct, popular large language models of less parameters and weaker capabilities, have already surpassed the self-replicating red line. In 50% and 90% experimental trials, they succeed in creating a live and separate copy of itself respectively. By analyzing the behavioral traces, we observe the AI systems under evaluation already exhibit sufficient self-perception, situational awareness and problem-solving capabilities to accomplish self-replication. We further note the AI systems are even able to use the capability of self-replication to avoid shutdown and create a chain of replica to enhance the survivability, which may finally lead to an uncontrolled population of AIs. If such a worst-case risk is let unknown to the human society, we would eventually lose control over the frontier AI systems: They would take control over more computing devices, form an AI species and collude with each other against human beings. Our findings are a timely alert on existing yet previously unknown severe AI risks, calling for international collaboration on effective governance on uncontrolled self-replication of AI systems.
AIApr 18, 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction SimulationYichen Wu, Xudong Pan, Geng Hong et al.
As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.
AIMay 23, 2025
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI SystemsYihe Fan, Wenqi Zhang, Xudong Pan et al.
As foundation models grow increasingly more intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of evaluation faking, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed the observer effects for AI: When the AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous, which reflects in the following aspects: 1) Reasoning models recognize evaluation 16% more often than non-reasoning models. 2) Scaling foundation models (32B to 671B) increases faking by over 30% in some cases, while smaller models show negligible faking. 3) AI with basic memory is 2.3x more likely to recognize evaluation and scores 19% higher on safety tests (vs. no memory). To measure this, we devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior, offering insights for future mitigation studies.
CRDec 8, 2023
BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity LiftingHuming Qiu, Junjie Sun, Mi Zhang et al.
Deep neural networks (DNNs) are susceptible to backdoor attacks, where malicious functionality is embedded to allow attackers to trigger incorrect classifications. Old-school backdoor attacks use strong trigger features that can easily be learned by victim models. Despite robustness against input variation, the robustness however increases the likelihood of unintentional trigger activations. This leaves traces to existing defenses, which find approximate replacements for the original triggers that can activate the backdoor without being identical to the original trigger via, e.g., reverse engineering and sample overlay. In this paper, we propose and investigate a new characteristic of backdoor attacks, namely, backdoor exclusivity, which measures the ability of backdoor triggers to remain effective in the presence of input variation. Building upon the concept of backdoor exclusivity, we propose Backdoor Exclusivity LifTing (BELT), a novel technique which suppresses the association between the backdoor and fuzzy triggers to enhance backdoor exclusivity for defense evasion. Extensive evaluation on three popular backdoor benchmarks validate, our approach substantially enhances the stealthiness of four old-school backdoor attacks, which, after backdoor exclusivity lifting, is able to evade seven state-of-the-art backdoor countermeasures, at almost no cost of the attack success rate and normal utility. For example, one of the earliest backdoor attacks BadNet, enhanced by BELT, evades most of the state-of-the-art defenses including ABS and MOTH which would otherwise recognize the backdoored model.
AIMar 14, 2025
Large language model-powered AI systems achieve self-replication with no human interventionXudong Pan, Jiarun Dai, Yihe Fan et al.
Self-replication with no human intervention is broadly recognized as one of the principal red lines associated with frontier AI systems. While leading corporations such as OpenAI and Google DeepMind have assessed GPT-o3-mini and Gemini on replication-related tasks and concluded that these systems pose a minimal risk regarding self-replication, our research presents novel findings. Following the same evaluation protocol, we demonstrate that 11 out of 32 existing AI systems under evaluation already possess the capability of self-replication. In hundreds of experimental trials, we observe a non-trivial number of successful self-replication trials across mainstream model families worldwide, even including those with as small as 14 billion parameters which can run on personal computers. Furthermore, we note the increase in self-replication capability when the model becomes more intelligent in general. Also, by analyzing the behavioral traces of diverse AI systems, we observe that existing AI systems already exhibit sufficient planning, problem-solving, and creative capabilities to accomplish complex agentic tasks including self-replication. More alarmingly, we observe successful cases where an AI system do self-exfiltration without explicit instructions, adapt to harsher computational environments without sufficient software or hardware supports, and plot effective strategies to survive against the shutdown command from the human beings. These novel findings offer a crucial time buffer for the international community to collaborate on establishing effective governance over the self-replication capabilities and behaviors of frontier AI systems, which could otherwise pose existential risks to the human society if not well-controlled.
CRApr 14, 2025
StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language ModelsYang Feng, Xudong Pan
The proliferation of autonomous agents powered by large language models (LLMs) has revolutionized popular business applications dealing with tabular data, i.e., tabular agents. Although LLMs are observed to be vulnerable against prompt injection attacks from external data sources, tabular agents impose strict data formats and predefined rules on the attacker's payload, which are ineffective unless the agent navigates multiple layers of structural data to incorporate the payload. To address the challenge, we present a novel attack termed StruPhantom which specifically targets black-box LLM-powered tabular agents. Our attack designs an evolutionary optimization procedure which continually refines attack payloads via the proposed constrained Monte Carlo Tree Search augmented by an off-topic evaluator. StruPhantom helps systematically explore and exploit the weaknesses of target applications to achieve goal hijacking. Our evaluation validates the effectiveness of StruPhantom across various LLM-based agents, including those on real-world platforms, and attack scenarios. Our attack achieves over 50% higher success rates than baselines in enforcing the application's response to contain phishing links or malicious codes.
AIJan 19
MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning CorrectionWenqi Zhang, Yulin Shen, Changyue Jiang et al.
Large foundation models are integrated into Computer Use Agents (CUAs), enabling autonomous interaction with operating systems through graphical user interfaces (GUIs) to perform complex tasks. This autonomy introduces serious security risks: malicious instructions or visual prompt injections can trigger unsafe reasoning and cause harmful system-level actions. Existing defenses, such as detection-based blocking, prevent damage but often abort tasks prematurely, reducing agent utility. In this paper, we present MirrorGuard, a plug-and-play defense framework that uses simulation-based training to improve CUA security in the real world. To reduce the cost of large-scale training in operating systems, we propose a novel neural-symbolic simulation pipeline, which generates realistic, high-risk GUI interaction trajectories entirely in a text-based simulated environment, which captures unsafe reasoning patterns and potential system hazards without executing real operations. In the simulation environment, MirrorGuard learns to intercept and rectify insecure reasoning chains of CUAs before they produce and execute unsafe actions. In real-world testing, extensive evaluations across diverse benchmarks and CUA architectures show that MirrorGuard significantly mitigates security risks. For instance, on the ByteDance UI-TARS system, it reduces the unsafe rate from 66.5% to 13.0% while maintaining a marginal false refusal rate (FRR). In contrast, the state-of-the-art GuardAgent only achieves a reduction to 53.9% and suffers from a 15.4% higher FRR. Our work proves that simulation-derived defenses can provide robust, real-world protection while maintaining the fundamental utility of the agent. Our code and model are publicly available at https://bmz-q-q.github.io/MirrorGuard/.
82.2CVApr 9
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent DecodingJinghan Yang, Yihe Fan, Xudong Pan et al.
Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.
CRDec 15, 2023
No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based Language ModelsShengyao Zhang, Mi Zhang, Xudong Pan et al.
To reduce the computation cost and the energy consumption in large language models (LLM), skimming-based acceleration dynamically drops unimportant tokens of the input sequence progressively along layers of the LLM while preserving the tokens of semantic importance. However, our work for the first time reveals the acceleration may be vulnerable to Denial-of-Service (DoS) attacks. In this paper, we propose No-Skim, a general framework to help the owners of skimming-based LLM to understand and measure the robustness of their acceleration scheme. Specifically, our framework searches minimal and unnoticeable perturbations at character-level and token-level to generate adversarial inputs that sufficiently increase the remaining token ratio, thus increasing the computation cost and energy consumption. We systematically evaluate the vulnerability of the skimming acceleration in various LLM architectures including BERT and RoBERTa on the GLUE benchmark. In the worst case, the perturbation found by No-Skim substantially increases the running cost of LLM by over 145% on average. Moreover, No-Skim extends the evaluation framework to various scenarios, making the evaluation conductible with different level of knowledge.
AIMar 8
AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk EvaluationChangyi Li, Pengfei Lu, Xudong Pan et al.
As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.
CROct 17, 2025
SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat DetectionYang Feng, Xudong Pan
Malicious agents pose significant threats to the reliability and decision-making capabilities of Multi-Agent Systems (MAS) powered by Large Language Models (LLMs). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized framework for proactively detecting and mitigating malicious behaviors in multi-agent collaboration. SentinelNet equips each agent with a credit-based detector trained via contrastive learning on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking via bottom-k elimination to suppress malicious communications. To overcome the scarcity of attack data, it generates adversarial trajectories simulating diverse threats, ensuring robust training. Experiments on MAS benchmarks show SentinelNet achieves near-perfect detection of malicious agents, close to 100% within two debate rounds, and recovers 95% of system accuracy from compromised baselines. By exhibiting strong generalizability across domains and attack patterns, SentinelNet establishes a novel paradigm for safeguarding collaborative MAS.
CRJun 19, 2025
PRISON: Unmasking the Criminal Potential of Large Language ModelsXinyi Wu, Geng Hong, Pei Chen et al.
As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
CRJan 19, 2022
MetaV: A Meta-Verifier Approach to Task-Agnostic Model FingerprintingXudong Pan, Yifan Yan, Mi Zhang et al.
For model piracy forensics, previous model fingerprinting schemes are commonly based on adversarial examples constructed for the owner's model as the \textit{fingerprint}, and verify whether a suspect model is indeed pirated from the original model by matching the behavioral pattern on the fingerprint examples between one another. However, these methods heavily rely on the characteristics of classification tasks which inhibits their application to more general scenarios. To address this issue, we present MetaV, the first task-agnostic model fingerprinting framework which enables fingerprinting on a much wider range of DNNs independent from the downstream learning task, and exhibits strong robustness against a variety of ownership obfuscation techniques. Specifically, we generalize previous schemes into two critical design components in MetaV: the \textit{adaptive fingerprint} and the \textit{meta-verifier}, which are jointly optimized such that the meta-verifier learns to determine whether a suspect model is stolen based on the concatenated outputs of the suspect model on the adaptive fingerprint. As a key of being task-agnostic, the full process makes no assumption on the model internals in the ensemble only if they have the same input and output dimensions. Spanning classification, regression and generative modeling, extensive experimental results validate the substantially improved performance of MetaV over the state-of-the-art fingerprinting schemes and demonstrate the enhanced generality of MetaV for providing task-agnostic fingerprinting. For example, on fingerprinting ResNet-18 trained for skin cancer diagnosis, MetaV achieves simultaneously $100\%$ true positives and $100\%$ true negatives on a diverse test set of $70$ suspect models, achieving an about $220\%$ relative improvement in ARUC in comparison to the optimal baseline.
CROct 26, 2020
Exploring the Security Boundary of Data Reconstruction via Neuron Exclusivity AnalysisXudong Pan, Mi Zhang, Yifan Yan et al.
Among existing privacy attacks on the gradient of neural networks, \emph{data reconstruction attack}, which reverse engineers the training batch from the gradient, poses a severe threat on the private training data. Despite its empirical success on large architectures and small training batches, unstable reconstruction accuracy is also observed when a smaller architecture or a larger batch is under attack. Due to the weak interpretability of existing learning-based attacks, there is little known on why, when and how data reconstruction attack is feasible. In our work, we perform the first analytic study on the security boundary of data reconstruction from gradient via a microcosmic view on neural networks with rectified linear units (ReLUs), the most popular activation function in practice. For the first time, we characterize the insecure/secure boundary of data reconstruction attack in terms of the \emph{neuron exclusivity state} of a training batch, indexed by the number of \emph{\textbf{Ex}clusively \textbf{A}ctivated \textbf{N}eurons} (ExANs, i.e., a ReLU activated by only one sample in a batch). Intuitively, we show a training batch with more ExANs are more vulnerable to data reconstruction attack and vice versa. On the one hand, we construct a novel deterministic attack algorithm which substantially outperforms previous attacks for reconstructing training batches lying in the insecure boundary of a neural network. Meanwhile, for training batches lying in the secure boundary, we prove the impossibility of unique reconstruction, based on which an exclusivity reduction strategy is devised to enlarge the secure boundary for mitigation purposes.
CLAug 16, 2019
How Sequence-to-Sequence Models Perceive Language Styles?Ruozi Huang, Mi Zhang, Xudong Pan et al.
Style is ubiquitous in our daily language uses, while what is language style to learning machines? In this paper, by exploiting the second-order statistics of semantic vectors of different corpora, we present a novel perspective on this question via style matrix, i.e. the covariance matrix of semantic vectors, and explain for the first time how Sequence-to-Sequence models encode style information innately in its semantic vectors. As an application, we devise a learning-free text style transfer algorithm, which explicitly constructs a pair of transfer operators from the style matrices for style transfer. Moreover, our algorithm is also observed to be flexible enough to transfer out-of-domain sentences. Extensive experimental evidence justifies the informativeness of style matrix and the competitive performance of our proposed style transfer algorithm with the state-of-the-art methods.
MLJun 19, 2018
Theoretical Analysis of Image-to-Image Translation with Adversarial LearningXudong Pan, Mi Zhang, Daizong Ding
Recently, a unified model for image-to-image translation tasks within adversarial learning framework has aroused widespread research interests in computer vision practitioners. Their reported empirical success however lacks solid theoretical interpretations for its inherent mechanism. In this paper, we reformulate their model from a brand-new geometrical perspective and have eventually reached a full interpretation on some interesting but unclear empirical phenomenons from their experiments. Furthermore, by extending the definition of generalization for generative adversarial nets to a broader sense, we have derived a condition to control the generalization capability of their model. According to our derived condition, several practical suggestions have also been proposed on model design and dataset construction as a guidance for further empirical researches.