CRMay 7Code
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?Hao Li, Ruoyao Wen, Shanghao Shi et al.
AI agents that autonomously interact with external tools and environments have shown great promise across real-world applications. However, their reliance on external data exposes them to serious indirect prompt injection attacks, where malicious instructions embedded in third-party content hijack agent behaviors. To mitigate this threat, a growing number of defenses have been proposed and evaluated under existing agent security benchmarks. These benchmarks provide structured environments for comparing attacks and defenses, and have become a key driver for defense design and optimization. However, as agents move toward more complex and open-ended real-world deployments, there is a pressing need for benchmarks to become more adaptive and better reflect the dynamic environments faced by real-world agentic systems. In this work, we reveal three fundamental flaws in the current benchmarks and push the frontier along these dimensions: (i) lack of dynamic open-ended tasks, (ii) lack of helpful instructions, and (iii) simplistic user tasks. To bridge this gap, we introduce AgentDyn, a manually designed benchmark featuring 60 challenging open-ended tasks and 560 injection test cases across Shopping, GitHub, and Daily Life. Unlike prior static benchmarks, AgentDyn requires dynamic planning and incorporates helpful third-party instructions. Our evaluation of ten state-of-the-art defenses suggests that almost all existing defenses are either not secure enough or suffer from significant over-defense, revealing that existing defenses are still far from real-world deployment. Our benchmark is available at https://github.com/leolee99/AgentDyn.
LGNov 10, 2023Code
Scale-MIA: A Scalable Model Inversion Attack against Secure Federated Learning via Latent Space ReconstructionShanghao Shi, Ning Wang, Yang Xiao et al.
Federated learning is known for its capability to safeguard the participants' data privacy. However, recently emerged model inversion attacks (MIAs) have shown that a malicious parameter server can reconstruct individual users' local data samples from model updates. The state-of-the-art attacks either rely on computation-intensive iterative optimization methods to reconstruct each input batch, making scaling difficult, or involve the malicious parameter server adding extra modules before the global model architecture, rendering the attacks too conspicuous and easily detectable. To overcome these limitations, we propose Scale-MIA, a novel MIA capable of efficiently and accurately reconstructing local training samples from the aggregated model updates, even when the system is protected by a robust secure aggregation (SA) protocol. Scale-MIA utilizes the inner architecture of models and identifies the latent space as the critical layer for breaching privacy. Scale-MIA decomposes the complex reconstruction task into an innovative two-step process. The first step is to reconstruct the latent space representations (LSRs) from the aggregated model updates using a closed-form inversion mechanism, leveraging specially crafted linear layers. Then in the second step, the LSRs are fed into a fine-tuned generative decoder to reconstruct the whole input batch. We implemented Scale-MIA on commonly used machine learning models and conducted comprehensive experiments across various settings. The results demonstrate that Scale-MIA achieves excellent performance on different datasets, exhibiting high reconstruction rates, accuracy, and attack efficiency on a larger scale compared to state-of-the-art MIAs. Our code is available at https://github.com/unknown123489/Scale-MIA.
CRDec 12, 2025Code
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive ScoringPeichun Hua, Hao Li, Shanghao Shi et al.
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.
LGJul 13, 2024
MedLeak: Multimodal Medical Data Leakage in Secure Federated Learning with Crafted ModelsShanghao Shi, Md Shahedul Haque, Abhijeet Parida et al.
Federated learning (FL) allows participants to collaboratively train machine learning models while keeping their data local, making it ideal for collaborations among healthcare institutions on sensitive data. However, in this paper, we propose a novel privacy attack called MedLeak, which allows a malicious FL server to recover high-quality site-specific private medical data from the client model updates. MedLeak works by introducing an adversarially crafted model during the FL training process. Honest clients, unaware of the insidious changes in the published models, continue to send back their updates as per the standard FL protocol. Leveraging a novel analytical method, MedLeak can efficiently recover private client data from the aggregated parameter updates, eliminating costly optimization. In addition, the scheme relies solely on the aggregated updates, thus rendering secure aggregation protocols ineffective, as they depend on the randomization of intermediate results for security while leaving the final aggregated results unaltered. We implement MedLeak on medical image datasets (MedMNIST, COVIDx CXR-4, and Kaggle Brain Tumor MRI), as well as a medical text dataset (MedAbstract). The results demonstrate that our attack achieves high recovery rates and strong quantitative scores on both image and text datasets. We also thoroughly evaluate MedLeak across different attack parameters, providing insights into key factors that influence attack performance and potential defenses. Furthermore, we demonstrate that the recovered data can support downstream tasks such as disease classification with minimal performance loss. Our findings validate the need for enhanced privacy measures in FL systems, particularly for safeguarding sensitive medical data against powerful model inversion attacks.
LGJul 12, 2024
BoBa: Boosting Backdoor Detection through Data Distribution Inference in Federated LearningNing Wang, Shanghao Shi, Yang Xiao et al.
Federated learning, while being a promising approach for collaborative model training, is susceptible to poisoning attacks due to its decentralized nature. Backdoor attacks, in particular, have shown remarkable stealthiness, as they selectively compromise predictions for inputs containing triggers. Previous endeavors to detect and mitigate such attacks are based on the Independent and Identically Distributed (IID) data assumption where benign model updates exhibit high-level similarity in multiple feature spaces due to IID data. Thus, outliers are detected as backdoor attacks. Nevertheless, non-IID data presents substantial challenges in backdoor attack detection, as the data variety introduces variance among benign models, making outlier detection-based mechanisms less effective. We propose a novel distribution-aware anomaly detection mechanism, BoBa, to address this problem. In order to differentiate outliers arising from data variety versus backdoor attack, we propose to break down the problem into two steps: clustering clients utilizing their data distribution followed by a voting-based detection. Based on the intuition that clustering and subsequent backdoor detection can drastically benefit from knowing client data distributions, we propose a novel data distribution inference mechanism. To improve detection robustness, we introduce an overlapping clustering method, where each client is associated with multiple clusters, ensuring that the trustworthiness of a model update is assessed collectively by multiple clusters rather than a single cluster. Through extensive evaluations, we demonstrate that BoBa can reduce the attack success rate to lower than 0.001 while maintaining high main task accuracy across various attack strategies and experimental settings.
CRMay 3, 2024Code
ProFLingo: A Fingerprinting-based Intellectual Property Protection Scheme for Large Language ModelsHeng Jin, Chaoyu Zhang, Shanghao Shi et al.
Large language models (LLMs) have attracted significant attention in recent years. Due to their "Large" nature, training LLMs from scratch consumes immense computational resources. Since several major players in the artificial intelligence (AI) field have open-sourced their original LLMs, an increasing number of individuals and smaller companies are able to build derivative LLMs based on these open-sourced models at much lower costs. However, this practice opens up possibilities for unauthorized use or reproduction that may not comply with licensing agreements, and fine-tuning can change the model's behavior, thus complicating the determination of model ownership. Current intellectual property (IP) protection schemes for LLMs are either designed for white-box settings or require additional modifications to the original model, which restricts their use in real-world settings. In this paper, we propose ProFLingo, a black-box fingerprinting-based IP protection scheme for LLMs. ProFLingo generates queries that elicit specific responses from an original model, thereby establishing unique fingerprints. Our scheme assesses the effectiveness of these queries on a suspect model to determine whether it has been derived from the original model. ProFLingo offers a non-invasive approach, which neither requires knowledge of the suspect model nor modifications to the base model or its training process. To the best of our knowledge, our method represents the first black-box fingerprinting technique for IP protection for LLMs. Our source code and generated queries are available at: https://github.com/hengvt/ProFLingo.
LGApr 30
Low Rank Adaptation for Adversarial PerturbationHan Liu, Shanghao Shi, Yevgeniy Vorobeychik et al.
Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network layers using low-rank matrices. Since the generation of adversarial examples is an optimization process analogous to model training, this naturally raises the question: Do adversarial perturbations exhibit a similar low-rank structure? In this paper, we provide both theoretical analysis and extensive empirical investigation across various attack methods, model architectures, and datasets to show that adversarial perturbations indeed possess an inherently low-rank structure. This insight opens up new opportunities for improving both adversarial attacks and defenses. We mainly focus on leveraging this low-rank property to improve the efficiency and effectiveness of black-box adversarial attacks, which often suffer from excessive query requirements. Our method follows a two-step approach. First, we use a reference model and auxiliary data to guide the projection of gradients into a low-dimensional subspace. Next, we confine the perturbation search in black-box attacks to this low-rank subspace, significantly improving the efficiency and effectiveness of the adversarial attacks. We evaluated our approach across a range of attack methods, benchmark models, datasets, and threat models. The results demonstrate substantial and consistent improvements in the performance of our low-rank adversarial attacks compared to conventional methods.
CRAug 30, 2025
Enabling Trustworthy Federated Learning via Remote Attestation for Mitigating Byzantine ThreatsChaoyu Zhang, Heng Jin, Shanghao Shi et al.
Federated Learning (FL) has gained significant attention for its privacy-preserving capabilities, enabling distributed devices to collaboratively train a global model without sharing raw data. However, its distributed nature forces the central server to blindly trust the local training process and aggregate uncertain model updates, making it susceptible to Byzantine attacks from malicious participants, especially in mission-critical scenarios. Detecting such attacks is challenging due to the diverse knowledge across clients, where variations in model updates may stem from benign factors, such as non-IID data, rather than adversarial behavior. Existing data-driven defenses struggle to distinguish malicious updates from natural variations, leading to high false positive rates and poor filtering performance. To address this challenge, we propose Sentinel, a remote attestation (RA)-based scheme for FL systems that regains client-side transparency and mitigates Byzantine attacks from a system security perspective. Our system employs code instrumentation to track control-flow and monitor critical variables in the local training process. Additionally, we utilize a trusted training recorder within a Trusted Execution Environment (TEE) to generate an attestation report, which is cryptographically signed and securely transmitted to the server. Upon verification, the server ensures that legitimate client training processes remain free from program behavior violation or data manipulation, allowing only trusted model updates to be aggregated into the global model. Experimental results on IoT devices demonstrate that Sentinel ensures the trustworthiness of the local training integrity with low runtime and memory overhead.
CRMar 8
Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AIHeng Jin, Chaoyu Zhang, Hexuan Yu et al.
Cloud-based infrastructures have become the dominant platform for deploying large models, particularly large language models (LLMs). Fine-tuning and inference are increasingly delegated to cloud providers for simplified deployment and access to proprietary models, yet this creates a fundamental trust gap: although cryptographic and TEE-based verification exist, the scale of modern LLMs renders them prohibitive, leaving clients unable to practically audit these processes. This lack of transparency creates concrete security risks that can silently compromise service integrity. We present AFTUNE, an auditable and verifiable framework that ensures the computation integrity of cloud-based fine-tuning and inference. AFTUNE incorporates a lightweight recording and spot-check mechanism that produces verifiable traces of execution. These traces enable clients to later audit whether the training and inference processes followed the agreed configurations. Our evaluation shows that AFTUNE imposes practical computation overhead while enabling selective and efficient verification, demonstrating that trustworthy model services are achievable in today's cloud environments.