LGApr 29
Dynamic Adversarial Fine-Tuning Reorganizes Refusal GeometryWenhao Lan, Shan Li, Junbin Yang et al.
Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbreak robustness, yet does not explain how dynamic adversarial fine-tuning changes refusal carriers across training. We present a measurement-driven mechanism study, not a new defense, on one 7B backbone under supervised fine-tuning (SFT) and R2D2-style dynamic adversarial fine-tuning. Our protocol aligns fixed-source HarmBench, StrongREJECT, and XSTest with a five-anchor refusal-geometry suite and causal interventions. R2D2 drives fixed-source HarmBench ASR to 0.000 at steps 50 and 100, then partially reopens to 0.035 at step 250 and 0.250 at step 500; SFT remains less robust, with ASR between 0.505 and 0.588 at the same anchors. On XSTest, R2D2 any-refusal is 1.000 early, then falls to 0.664 and 0.228. Geometrically, R2D2 preserves a late-layer admissible carrier through step 100 before relocating to an early-layer carrier, while effective rank remains near 1.23--1.27. Causal interventions indicate low-dimensional but utility-coupled control. These results support a reorganization account rather than a drift-only account, with evidence limited to one backbone and fixed-source attacks.
CVApr 22, 2024
CloudFort: Enhancing Robustness of 3D Point Cloud Classification Against Backdoor Attacks via Spatial Partitioning and Ensemble PredictionWenhao Lan, Yijun Yang, Haihua Shen et al.
The increasing adoption of 3D point cloud data in various applications, such as autonomous vehicles, robotics, and virtual reality, has brought about significant advancements in object recognition and scene understanding. However, this progress is accompanied by new security challenges, particularly in the form of backdoor attacks. These attacks involve inserting malicious information into the training data of machine learning models, potentially compromising the model's behavior. In this paper, we propose CloudFort, a novel defense mechanism designed to enhance the robustness of 3D point cloud classifiers against backdoor attacks. CloudFort leverages spatial partitioning and ensemble prediction techniques to effectively mitigate the impact of backdoor triggers while preserving the model's performance on clean data. We evaluate the effectiveness of CloudFort through extensive experiments, demonstrating its strong resilience against the Point Cloud Backdoor Attack (PCBA). Our results show that CloudFort significantly enhances the security of 3D point cloud classification models without compromising their accuracy on benign samples. Furthermore, we explore the limitations of CloudFort and discuss potential avenues for future research in the field of 3D point cloud security. The proposed defense mechanism represents a significant step towards ensuring the trustworthiness and reliability of point-cloud-based systems in real-world applications.
ITJun 10, 2019
HTDet: A Clustering Method using Information Entropy for Hardware Trojan DetectionRenjie Lu, Haihua Shen, Feng Zhang et al.
Hardware Trojans (HTs) have drawn more and more attention in both academia and industry because of its significant potential threat. In this paper, we proposed a novel HT detection method using information entropy based clustering, named HTDet. The key insight of HTDet is that the Trojan usually be inserted in the regions with low controllability and low observability in order to maintain high concealment, which will result in that Trojan logics appear extremely low transitions during the simulation. This means that the logical regions with the low transitions will provide us with much more abundant and more important information for HT detection. Therefore, HTDet applies information theory technology and a typical density-based clustering algorithm called Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to detect all suspicious Trojan logics in circuit under detection (CUD). DBSCAN is an unsupervised learning algorithm, which can improve the applicability of HTDet. Besides, we develop a heuristic test patterns generation method using mutual information to increase the transitions of suspicious Trojan logics. Experimental evaluation with benchmarks demenstrates the effectiveness of HTDet.
AINov 3, 2016
Extracting Actionability from Machine Learning Models by Sub-optimal Deterministic PlanningQiang Lyu, Yixin Chen, Zhaorong Li et al.
A main focus of machine learning research has been improving the generalization accuracy and efficiency of prediction models. Many models such as SVM, random forest, and deep neural nets have been proposed and achieved great success. However, what emerges as missing in many applications is actionability, i.e., the ability to turn prediction results into actions. For example, in applications such as customer relationship management, clinical prediction, and advertisement, the users need not only accurate prediction, but also actionable instructions which can transfer an input to a desirable goal (e.g., higher profit repays, lower morbidity rates, higher ads hit rates). Existing effort in deriving such actionable knowledge is few and limited to simple action models which restricted to only change one attribute for each action. The dilemma is that in many real applications those action models are often more complex and harder to extract an optimal solution. In this paper, we propose a novel approach that achieves actionability by combining learning with planning, two core areas of AI. In particular, we propose a framework to extract actionable knowledge from random forest, one of the most widely used and best off-the-shelf classifiers. We formulate the actionability problem to a sub-optimal action planning (SOAP) problem, which is to find a plan to alter certain features of a given input so that the random forest would yield a desirable output, while minimizing the total costs of actions. Technically, the SOAP problem is formulated in the SAS+ planning formalism, and solved using a Max-SAT based approach. Our experimental results demonstrate the effectiveness and efficiency of the proposed approach on a personal credit dataset and other benchmarks. Our work represents a new application of automated planning on an emerging and challenging machine learning paradigm.