HCJun 14, 2023Code
Maestro: A Gamified Platform for Teaching AI RobustnessMargarita Geleta, Jiacen Xu, Manikanta Loya et al.
Although the prevention of AI vulnerabilities is critical to preserve the safety and privacy of users and businesses, educational tools for robust AI are still underdeveloped worldwide. We present the design, implementation, and assessment of Maestro. Maestro is an effective open-source game-based platform that contributes to the advancement of robust AI education. Maestro provides goal-based scenarios where college students are exposed to challenging life-inspired assignments in a competitive programming environment. We assessed Maestro's influence on students' engagement, motivation, and learning success in robust AI. This work also provides insights into the design features of online learning tools that promote active learning opportunities in the robust AI domain. We analyzed the reflection responses (measured with Likert scales) of 147 undergraduate students using Maestro in two quarterly college courses in AI. According to the results, students who felt the acquisition of new skills in robust AI tended to appreciate highly Maestro and scored highly on material consolidation, curiosity, and mastery in robust AI. Moreover, the leaderboard, our key gamification element in Maestro, has effectively contributed to students' engagement and learning. Results also indicate that Maestro can be effectively adapted to any course length and depth without losing its educational quality.
CROct 16, 2023
A Comprehensive Study of Privacy Risks in Curriculum LearningJoann Qiongna Chen, Xinlei He, Zheng Li et al.
Training a machine learning model with data following a meaningful order, i.e., from easy to hard, has been proven to be effective in accelerating the training process and achieving better model performance. The key enabling technique is curriculum learning (CL), which has seen great success and has been deployed in areas like image and text classification. Yet, how CL affects the privacy of machine learning is unclear. Given that CL changes the way a model memorizes the training data, its influence on data privacy needs to be thoroughly evaluated. To fill this knowledge gap, we perform the first study and leverage membership inference attack (MIA) and attribute inference attack (AIA) as two vectors to quantify the privacy leakage caused by CL. Our evaluation of nine real-world datasets with attack methods (NN-based, metric-based, label-only MIA, and NN-based AIA) revealed new insights about CL. First, MIA becomes slightly more effective when CL is applied, but the impact is much more prominent to a subset of training samples ranked as difficult. Second, a model trained under CL is less vulnerable under AIA, compared to MIA. Third, the existing defense techniques like DP-SGD, MemGuard, and MixupMMD are still effective under CL, though DP-SGD has a significant impact on target model accuracy. Finally, based on our insights into CL, we propose a new MIA, termed Diff-Cali, which exploits the difficulty scores for result calibration and is demonstrated to be effective against all CL methods and the normal training method. With this study, we hope to draw the community's attention to the unintended privacy risks of emerging machine-learning techniques and develop new attack benchmarks and defense solutions.
ITMay 21
Information-Theoretic Decentralized Secure Aggregation with User DropoutsZhou Li, Xiang Zhang, Yizhou Zhao et al.
This paper investigates the fundamental limits of information-theoretic decentralized secure aggregation (DSA) with user dropouts. We consider a fully decentralized network where $K$ users communicate over broadcast channels without a trusted aggregation server. Each user holds a private input and aims to recover the sum of the surviving users' inputs (users may drop) while ensuring that no additional information about individual inputs is revealed to that user, even if it can collude with other users. A two-round communication protocol is considered, where we assume at least $U$ users survive and each user can collude with at most $T$ other users. For this setting, the optimal communication rate region is fully characterized: we show that DSA is infeasible if $U\le T+1$; otherwise, the optimal rate region is given by $R_1\geq 1$ and $R_2\geq \frac{1}{U-T-1}$, where $R_1$ and $R_2$ denote the first- and second-round communication rates, respectively. The proposed aggregation scheme is based on correlated secret keys constructed from $(T+1)$-private maximum distance separable (MDS) matrices, which simultaneously provide robustness against user dropouts and security against collusion. We also derive tight converse bounds that establish the optimality of the proposed scheme. Our result shows that the optimal second-round communication rate depends only on the effective redundancy level $U-T-1$ regardless the total number of users.
CRMay 20
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion DetectionDanyu Sun, Jinghuai Zhang, Yuan Tian et al.
Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings. Our evaluation of frontier LLMs reveals substantial performance gaps across datasets. While many models achieve high precision (often above 0.8) on simpler datasets, their performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. We further analyze model behavior and identify distinct regimes, including conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. Overall, our results highlight that while LLMs show strong potential for HIDS, their effectiveness is highly sensitive to data complexity, and robust system design is essential for reliable deployment.
ITMar 20
On the Fundamental Limits of Hierarchical Secure Aggregation with Dropout and Collusion ResilienceZhou Li, Yizhou Zhao, Xiang Zhang et al.
We study the fundamental communication limits of information-theoretic secure aggregation in a hierarchical network consisting of a server, multiple relays, and multiple users per relay. Communication proceeds over two rounds and two hops, and the system is subject to arbitrary user and relay dropouts. Up to $T$ users may collude with either the server or any single relay. The server aims to recover the sum of the inputs of all users that survive the first round, while learning no additional information beyond the aggregate sum and the inputs of the colluding users. Each relay, however, must learn nothing about the users' inputs except for the information revealed by the inputs of the colluding users under the same collusion model. We introduce a four-dimensional rate tuple that captures the communication cost across rounds and hops. Under a delayed message availability model, we establish necessary and sufficient conditions for feasibility and fully characterize the optimal first-round communication rates. For the second round, we characterize the optimal user-to-relay rate and derive lower and upper bounds on the relay-to-server rate. While these bounds do not coincide in general, they are tight in certain regimes of interest. Our results reveal a sharp threshold phenomenon: secure aggregation is feasible if and only if the total number of surviving users across surviving relays exceeds the collusion threshold. Achievability is established via a vector linear coding scheme with carefully structured correlated randomness exhibiting MDS-like properties, ensuring correctness and information-theoretic security under all possible dropout patterns. Entropic converse bounds are also derived.
ITMar 22
Information-Theoretic Secure Aggregation in Decentralized NetworksXiang Zhang, Zhou Li, Shuangyang Li et al.
Motivated by the increasing demand for data security in decentralized federated learning (FL) and stochastic optimization, we formulate and investigate the problem of information-theoretic \emph{decentralized secure aggregation} (DSA). Specifically, we consider a network of $K$ interconnected users, each holding a private input, representing, for example, local model updates in FL, who aim to simultaneously compute the sum of all inputs while satisfying the security requirement that no user, even when colluding with up to $T$ others, learns anything beyond the intended sum. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one bit of the desired input sum, each user must (i) transmit at least one bit to all other users, (ii) hold at least one bit of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key bits. Our result establishes the fundamental performance limits of DSA and offers insights into the design of provably secure and communication-efficient protocols for distributed learning systems.
LGSep 2, 2025Code
Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident ReportsJian Chen, Jiabao Dou, Jinbao Tian et al.
The automatic classification of occupational accident reports is a critical research area for enhancing workplace safety and enabling large-scale risk analysis. However, the severe class imbalance inherent in these real-world datasets often compromises the performance of analytical models, particularly for rare but severe incident types, hindering the development of reliable automated systems. To address this challenge, we propose ABEX-RAT, a novel and efficient framework that synergizes generative data augmentation with robust adversarial training. Our approach first employs a twostep abstractive-expansive (ABEX) pipeline, which leverages a large language model to distill core incident semantics and then uses a generative model to create diverse, highquality synthetic samples for underrepresented classes. Subsequently, a lightweight classifier is trained on the augmented data using a computationally efficient random adversarial training (RAT) protocol, which stochastically applies perturbations to enhance model generalization and robustness without significant overhead. Experimental results on the public OSHA dataset demonstrate that our method achieves new state-of-the-art performance, reaching a macro-F1 score of 90.32% and significantly outperforming previous SOTA and fine-tuned large model baselines. Our work validates that this synergistic strategy is a highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks. The code is publicly available at:https://github.com/nxcc-lab/ABEX-RAT.
CLAug 10, 2025Code
Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule CheckingJian Chen, Jinbao Tian, Yankui Li et al.
Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.
CRMar 2, 2024
AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacksJiacen Xu, Jack W. Stokes, Geoff McDonald et al.
Large language models (LLMs) have demonstrated impressive results on natural language tasks, and security researchers are beginning to employ them in both offensive and defensive systems. In cyber-security, there have been multiple research efforts that utilize LLMs focusing on the pre-breach stage of attacks like phishing and malware generation. However, so far there lacks a comprehensive study regarding whether LLM-based systems can be leveraged to simulate the post-breach stage of attacks that are typically human-operated, or "hands-on-keyboard" attacks, under various attack techniques and environments. As LLMs inevitably advance, they may be able to automate both the pre- and post-breach attack stages. This shift may transform organizational attacks from rare, expert-led events to frequent, automated operations requiring no expertise and executed at automation speed and scale. This risks fundamentally changing global computer security and correspondingly causing substantial economic impacts, and a goal of this work is to better understand these risks now so we can better prepare for these inevitable ever-more-capable LLMs on the horizon. On the immediate impact side, this research serves three purposes. First, an automated LLM-based, post-breach exploitation framework can help analysts quickly test and continually improve their organization's network security posture against previously unseen attacks. Second, an LLM-based penetration test system can extend the effectiveness of red teams with a limited number of human analysts. Finally, this research can help defensive systems and teams learn to detect novel attack behaviors preemptively before their use in the wild....
ITMay 3
Optimal Communication Rate of Secure Aggregation over Ring Networks with Pairwise KeysXiang Zhang, Han Yu, Zhou Li et al.
Information-theoretic topological secure aggregation (TSA)\cite{zhang2026information_regular} enables distributed users to compute neighborhood sums over arbitrary networks without revealing individual inputs, while remaining communication-efficient. It has broad applications, including secure model aggregation in decentralized federated learning (FL). Existing TSA formulations rely on arbitrarily correlated keys generated by a trusted key server, which introduces a single point of failure. In this paper, we instead study TSA with \tit{pairwise} secret keys, where each user pair $(i,j)$ shares an independent key $S_{i,j}$. Such keys can be established through inter-user communication, eliminating the need for a key server and improving robustness. Focusing on a ring topology with $K$ users, we characterize the minimum per-user communication rate: \tit{to securely compute one bit of the desired input sum, each user must send at least $1$ bit to its neighbors when $K=3,4$, and at least $2$ bits for all $K\ge 5$}. The higher rate in larger networks arises because each user must simultaneously satisfy two independent key-alignment constraints from its two neighborhoods, which cannot be resolved within a single broadcast symbol under pairwise key independence. We propose a linear pairwise-masking scheme that achieves these rates and prove its optimality via tight entropic converse bounds that exploit the dependency structure of the keys. Notably, for all $K\ge 4$, only a subset of the $\binom{K}{2}$ pairwise keys -- specifically, those between users at ring distance $2$ -- is sufficient to achieve optimality, revealing a nontrivial role of topological sparsity in secure aggregation.
ITApr 29
On the Capacity of Hierarchical Secure Aggregation with Groupwise KeysMinyang Lu, Zhou Li, Haiqiang Chen et al.
We study the hierarchical secure aggregation problem with groupwise keys. The problem consists of an aggregation server, $U$ relays, and $UV$ users, where each relay serves $V$ disjoint users, and each subset of $G$ users shares an independent groupwise key. Two security requirements are imposed: relay security and server security. Specifically, each relay must not learn any information about the users' inputs, and the server must not learn any additional information beyond the recovered sum of all inputs. We first show that the problem is infeasible when $G = 1$. For the feasible regime $1 < G \le UV$, we fully characterize the optimal rate region. In particular, we prove that both each user and each relay must transmit at least one symbol per input symbol. Furthermore, we characterize the minimum required groupwise key rate as $\max\left\{\frac{V}{\binom{UV}{G} - \binom{(U-1)V}{G}},\; \frac{U - 1}{\binom{UV}{G} - U \binom{V}{G}}\right\},$ where the two terms correspond to the constraints imposed by relay security and server security, respectively. For achievability, we propose an explicit linear coding scheme based on structured precoding matrices, and show that it satisfies both correctness and security requirements. The construction avoids permutation-based symmetrization by leveraging sufficiently generic matrix designs over large fields. Finally, we establish a matching converse, thereby characterizing the optimal rate region.
ITApr 29
Multi-Server Secure Aggregation with Arbitrary Collusion and Heterogeneous Security ConstraintsZhou Li, Xiang Zhang, Jiguang He et al.
We study the fundamental limits of multi-server secure aggregation over a two-hop network where multiple servers, each connected to a disjoint subset of users, jointly compute the sum of all users' inputs. The goal is to ensure that no server can infer any information about prescribed subsets of inputs beyond the desired aggregate, even when colluding with an arbitrary subset of users. Existing works largely focus on homogeneous security requirements, where all inputs are protected against colluding sets up to a given size. Such formulations are insufficient to capture more general scenarios in which different subsets of inputs may require protection against different collusion patterns. In this paper, we consider a general model with heterogeneous security requirements and arbitrary user collusion. We characterize the communication rates for all parameter regimes, and determine the minimum key rate required for secure aggregation in most regimes. In particular, we establish tight information-theoretic lower bounds and matching achievable schemes in a broad class of regimes. For the remaining regime, we derive a general lower bound together with an achievable scheme that attains it within a bounded gap. Our results reveal how the interplay between network topology and heterogeneous security constraints fundamentally determines the communication and key generation requirements, and generalize existing results on secure aggregation.
ITMar 6, 2025
Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User AssociationXiang Zhang, Zhou Li, Kai Wan et al.
Secure aggregation is motivated by federated learning (FL) where a cloud server aims to compute an averaged model (i.e., weights of deep neural networks) of the locally-trained models of numerous clients, while adhering to data security requirements. Hierarchical secure aggregation (HSA) extends this concept to a three-layer hierarchical network, where clustered users communicate with the server through an intermediate layer of relays. In HSA, beyond conventional server security, relay security is also enforced to ensure that the relays remain oblivious to the users' inputs (an abstraction of the local models in FL). Existing study on HSA assumes that each user is associated with only one relay, limiting opportunities for coding across inter-cluster users to achieve efficient communication and key generation. In this paper, we consider HSA with a cyclic association pattern where each user is connected to $B$ consecutive relays in a wrap-around manner. We propose an efficient aggregation scheme which includes a message design for the inputs inspired by gradient coding-a well-known technique for efficient communication in distributed computing-along with a highly non-trivial security key design. We also derive novel converse bounds on the minimum achievable communication and key rates using information-theoretic arguments.
CRMay 23, 2025
Dynamic Risk Assessments for Offensive Cybersecurity AgentsBoyi Wei, Benedikt Stroebl, Jiacen Xu et al. · princeton
Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.
ITJul 19, 2025
Collusion-Resilient Hierarchical Secure Aggregation with Heterogeneous Security ConstraintsZhou Li, Xiang Zhang, Jiawen Lv et al.
Motivated by federated learning (FL), secure aggregation (SA) aims to securely compute, as efficiently as possible, the sum of a set of inputs distributed across many users. To understand the impact of network topology, hierarchical secure aggregation (HSA) investigated the communication and secret key generation efficiency in a 3-layer relay network, where clusters of users are connected to the aggregation server through an intermediate layer of relays. Due to the pre-aggregation of the messages at the relays, HSA reduces the communication burden on the relay-to-server links and is able to support a large number of users. However, as the number of users increases, a practical challenge arises from heterogeneous security requirements--for example, users in different clusters may require varying levels of input protection. Motivated by this, we study weakly-secure HSA (WS-HSA) with collusion resilience, where instead of protecting all the inputs from any set of colluding users, only the inputs belonging to a predefined collection of user groups (referred to as security input sets) need to be protected against another predefined collection of user groups (referred to as collusion sets). Since the security input sets and collusion sets can be arbitrarily defined, our formulation offers a flexible framework for addressing heterogeneous security requirements in HSA. We characterize the optimal total key rate, i.e., the total number of independent key symbols required to ensure both server and relay security, for a broad range of parameter configurations. For the remaining cases, we establish lower and upper bounds on the optimal key rate, providing constant-factor gap optimality guarantees.
IVMar 19, 2025
FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image SynthesisYaofei Duan, Tao Tan, Zhiyuan Zhu et al.
Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.
ITAug 1, 2025
Information-Theoretic Decentralized Secure Aggregation with Collusion ResilienceXiang Zhang, Zhou Li, Shuangyang Li et al.
In decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of $K$ fully-connected users, each holding a private input -- an abstraction of local training data -- who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to $T$ other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in distributed learning systems.
CVDec 11, 2021
On Adversarial Robustness of Point Cloud Semantic SegmentationJiacen Xu, Zhe Zhou, Boyuan Feng et al.
Recent research efforts on 3D point cloud semantic segmentation (PCSS) have achieved outstanding performance by adopting neural networks. However, the robustness of these complex models have not been systematically analyzed. Given that PCSS has been applied in many safety-critical applications like autonomous driving, it is important to fill this knowledge gap, especially, how these models are affected under adversarial samples. As such, we present a comparative study of PCSS robustness. First, we formally define the attacker's objective under performance degradation and object hiding. Then, we develop new attack by whether to bound the norm. We evaluate different attack options on two datasets and three PCSS models. We found all the models are vulnerable and attacking point color is more effective. With this study, we call the attention of the research community to develop new approaches to harden PCSS models.
CRMar 8, 2021
Volcano: Stateless Cache Side-channel Attack by Exploiting Mesh InterconnectJunpeng Wan, Yanxiang Bi, Zhe Zhou et al.
Cache side-channel attacks lead to severe security threats to the settings that a CPU is shared across users, e.g., in the cloud. The existing attacks rely on sensing the micro-architectural state changes made by victims, and this assumption can be invalidated by combining spatial (\eg, Intel CAT) and temporal isolation (\eg, time protection). In this work, we advance the state of cache side-channel attacks by showing stateless cache side-channel attacks that cannot be defeated by both spatial and temporal isolation. This side-channel exploits the timing difference resulted from interconnect congestion. Specifically, to complete cache transactions, for Intel CPUs, cache lines would travel across cores via the CPU mesh interconnect. Nonetheless, the mesh links are shared by all cores, and cache isolation does not segregate the traffic. An attacker can generate interconnect traffic to contend with the victim's on a mesh link, hoping that extra delay will be measured. With the variant delays, the attacker can deduce the memory access pattern of a victim program, and infer its sensitive data. Based on this idea, we implement Volcano and test it against the existing RSA implementations of JDK. We found the RSA private key used by a victim process can be partially recovered. In the end, we propose a few directions for defense and call for the attention of the security community.
CRMay 24, 2020
Continuous Release of Data Streams under both Centralized and Local Differential PrivacyTianhao Wang, Joann Qiongna Chen, Zhikun Zhang et al.
In this paper, we study the problem of publishing a stream of real-valued data satisfying differential privacy (DP). One major challenge is that the maximal possible value can be quite large; thus it is necessary to estimate a threshold so that numbers above it are truncated to reduce the amount of noise that is required to all the data. The estimation must be done based on the data in a private fashion. We develop such a method that uses the Exponential Mechanism with a quality function that approximates well the utility goal while maintaining a low sensitivity. Given the threshold, we then propose a novel online hierarchical method and several post-processing techniques. Building on these ideas, we formalize the steps into a framework for private publishing of stream data. Our framework consists of three components: a threshold optimizer that privately estimates the threshold, a perturber that adds calibrated noises to the stream, and a smoother that improves the result using post-processing. Within our framework, we design an algorithm satisfying the more stringent setting of DP called local DP (LDP). To our knowledge, this is the first LDP algorithm for publishing streaming data. Using four real-world datasets, we demonstrate that our mechanism outperforms the state-of-the-art by a factor of 6-10 orders of magnitude in terms of utility (measured by the mean squared error of answering a random range query).
ITFeb 13, 2020
Conditional Disclosure of Secrets: A Noise and Signal Alignment ApproachZhou Li, Hua Sun
In the conditional disclosure of secrets (CDS) problem, Alice and Bob (each holds an input and a common secret) wish to disclose, as efficiently as possible, the secret to Carol if and only if their inputs satisfy some function. The capacity of CDS is the maximum number of bits of the secret that can be securely disclosed per bit of total communication. We characterize the necessary and sufficient condition for the extreme case where the capacity of CDS is the highest and is equal to 1/2. For the simplest instance where the capacity is smaller than 1/2, we show that the linear capacity is 2/5.
CRAug 31, 2019
Your Smart Home Can't Keep a Secret: Towards Automated Fingerprinting of IoT Traffic with Neural NetworksShuaike Dong, Zhou Li, Di Tang et al.
The IoT (Internet of Things) technology has been widely adopted in recent years and has profoundly changed the people's daily lives. However, in the meantime, such a fast-growing technology has also introduced new privacy issues, which need to be better understood and measured. In this work, we look into how private information can be leaked from network traffic generated in the smart home network. Although researchers have proposed techniques to infer IoT device types or user behaviors under clean experiment setup, the effectiveness of such approaches become questionable in the complex but realistic network environment, where common techniques like Network Address and Port Translation (NAPT) and Virtual Private Network (VPN) are enabled. Traffic analysis using traditional methods (e.g., through classical machine-learning models) is much less effective under those settings, as the features picked manually are not distinctive any more. In this work, we propose a traffic analysis framework based on sequence-learning techniques like LSTM and leveraged the temporal relations between packets for the attack of device identification. We evaluated it under different environment settings (e.g., pure-IoT and noisy environment with multiple non-IoT devices). The results showed our framework was able to differentiate device types with a high accuracy. This result suggests IoT network communications pose prominent challenges to users' privacy, even when they are protected by encryption and morphed by the network gateway. As such, new privacy protection methods on IoT traffic need to be developed towards mitigating this new issue.
CRJan 5, 2018
Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the WildShuaike Dong, Menghao Li, Wenrui Diao et al.
In this paper, we seek to better understand Android obfuscation and depict a holistic view of the usage of obfuscation through a large-scale investigation in the wild. In particular, we focus on four popular obfuscation approaches: identifier renaming, string encryption, Java reflection, and packing. To obtain the meaningful statistical results, we designed efficient and lightweight detection models for each obfuscation technique and applied them to our massive APK datasets (collected from Google Play, multiple third-party markets, and malware databases). We have learned several interesting facts from the result. For example, malware authors use string encryption more frequently, and more apps on third-party markets than Google Play are packed. We are also interested in the explanation of each finding. Therefore we carry out in-depth code analysis on some Android apps after sampling. We believe our study will help developers select the most suitable obfuscation approach, and in the meantime help researchers improve code analysis systems in the right direction.
CRMay 21, 2016
Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPUZhe Zhou, Wenrui Diao, Xiangyu Liu et al.
In this paper, we present that security threats coming with existing GPU memory management strategy are overlooked, which opens a back door for adversaries to freely break the memory isolation: they enable adversaries without any privilege in a computer to recover the raw memory data left by previous processes directly. More importantly, such attacks can work on not only normal multi-user operating systems, but also cloud computing platforms. To demonstrate the seriousness of such attacks, we recovered original data directly from GPU memory residues left by exited commodity applications, including Google Chrome, Adobe Reader, GIMP, Matlab. The results show that, because of the vulnerable memory management strategy, commodity applications in our experiments are all affected.
CRNov 18, 2014
Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log DataAlina Oprea, Zhou Li, Ting-Fang Yen et al.
Recent years have seen the rise of more sophisticated attacks including advanced persistent threats (APTs) which pose severe risks to organizations and governments by targeting confidential proprietary information. Additionally, new malware strains are appearing at a higher rate than ever before. Since many of these malware are designed to evade existing security products, traditional defenses deployed by most enterprises today, e.g., anti-virus, firewalls, intrusion detection systems, often fail at detecting infections at an early stage. We address the problem of detecting early-stage infection in an enterprise setting by proposing a new framework based on belief propagation inspired from graph theory. Belief propagation can be used either with "seeds" of compromised hosts or malicious domains (provided by the enterprise security operation center -- SOC) or without any seeds. In the latter case we develop a detector of C&C communication particularly tailored to enterprises which can detect a stealthy compromise of only a single host communicating with the C&C server. We demonstrate that our techniques perform well on detecting enterprise infections. We achieve high accuracy with low false detection and false negative rates on two months of anonymized DNS logs released by Los Alamos National Lab (LANL), which include APT infection attacks simulated by LANL domain experts. We also apply our algorithms to 38TB of real-world web proxy logs collected at the border of a large enterprise. Through careful manual investigation in collaboration with the enterprise SOC, we show that our techniques identified hundreds of malicious domains overlooked by state-of-the-art security products.
CRJul 21, 2014
An Empirical Study on Android for Saving Non-shared Data on Public StorageXiangyu Liu, Zhe Zhou, Wenrui Diao et al.
With millions of apps that can be downloaded from official or third-party market, Android has become one of the most popular mobile platforms today. These apps help people in all kinds of ways and thus have access to lots of user's data that in general fall into three categories: sensitive data, data to be shared with other apps, and non-sensitive data not to be shared with others. For the first and second type of data, Android has provided very good storage models: an app's private sensitive data are saved to its private folder that can only be access by the app itself, and the data to be shared are saved to public storage (either the external SD card or the emulated SD card area on internal FLASH memory). But for the last type, i.e., an app's non-sensitive and non-shared data, there is a big problem in Android's current storage model which essentially encourages an app to save its non-sensitive data to shared public storage that can be accessed by other apps. At first glance, it seems no problem to do so, as those data are non-sensitive after all, but it implicitly assumes that app developers could correctly identify all sensitive data and prevent all possible information leakage from private-but-non-sensitive data. In this paper, we will demonstrate that this is an invalid assumption with a thorough survey on information leaks of those apps that had followed Android's recommended storage model for non-sensitive data. Our studies showed that highly sensitive information from billions of users can be easily hacked by exploiting the mentioned problematic storage model. Although our empirical studies are based on a limited set of apps, the identified problems are never isolated or accidental bugs of those apps being investigated. On the contrary, the problem is rooted from the vulnerable storage model recommended by Android. To mitigate the threat, we also propose a defense framework.