LGOct 4, 2022Code
OpBoost: A Vertical Federated Tree Boosting Framework Based on Order-Preserving DesensitizationXiaochen Li, Yuke Hu, Weiran Liu et al.
Vertical Federated Learning (FL) is a new paradigm that enables users with non-overlapping attributes of the same data samples to jointly train a model without directly sharing the raw data. Nevertheless, recent works show that it's still not sufficient to prevent privacy leakage from the training process or the trained model. This paper focuses on studying the privacy-preserving tree boosting algorithms under the vertical FL. The existing solutions based on cryptography involve heavy computation and communication overhead and are vulnerable to inference attacks. Although the solution based on Local Differential Privacy (LDP) addresses the above problems, it leads to the low accuracy of the trained model. This paper explores to improve the accuracy of the widely deployed tree boosting algorithms satisfying differential privacy under vertical FL. Specifically, we introduce a framework called OpBoost. Three order-preserving desensitization algorithms satisfying a variant of LDP called distance-based LDP (dLDP) are designed to desensitize the training data. In particular, we optimize the dLDP definition and study efficient sampling distributions to further improve the accuracy and efficiency of the proposed algorithms. The proposed algorithms provide a trade-off between the privacy of pairs with large distance and the utility of desensitized values. Comprehensive evaluations show that OpBoost has a better performance on prediction accuracy of trained models compared with existing LDP approaches on reasonable settings. Our code is open source.
GTNov 11, 2025
Reliable and Private Utility Signaling for Data MarketsLi Peng, Jiayao Zhang, Yihang Wu et al.
The explosive growth of data has highlighted its critical role in driving economic growth through data marketplaces, which enable extensive data sharing and access to high-quality datasets. To support effective trading, signaling mechanisms provide participants with information about data products before transactions, enabling informed decisions and facilitating trading. However, due to the inherent free-duplication nature of data, commonly practiced signaling methods face a dilemma between privacy and reliability, undermining the effectiveness of signals in guiding decision-making. To address this, this paper explores the benefits and develops a non-TCP-based construction for a desirable signaling mechanism that simultaneously ensures privacy and reliability. We begin by formally defining the desirable utility signaling mechanism and proving its ability to prevent suboptimal decisions for both participants and facilitate informed data trading. To design a protocol to realize its functionality, we propose leveraging maliciously secure multi-party computation (MPC) to ensure the privacy and robustness of signal computation and introduce an MPC-based hash verification scheme to ensure input reliability. In multi-seller scenarios requiring fair data valuation, we further explore the design and optimization of the MPC-based KNN-Shapley method with improved efficiency. Rigorous experiments demonstrate the efficiency and practicality of our approach.
IRApr 27
Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at ScaleLiang Guo, Ge Song, Litao Deng et al.
Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.
CRAug 26, 2025
SecureV2X: An Efficient and Privacy-Preserving System for Vehicle-to-Everything (V2X) ApplicationsJoshua Lee, Ali Arastehfard, Weiran Liu et al.
Autonomous driving and V2X technologies have developed rapidly in the past decade, leading to improved safety and efficiency in modern transportation. These systems interact with extensive networks of vehicles, roadside infrastructure, and cloud resources to support their machine learning capabilities. However, the widespread use of machine learning in V2X systems raises issues over the privacy of the data involved. This is particularly concerning for smart-transit and driver safety applications which can implicitly reveal user locations or explicitly disclose medical data such as EEG signals. To resolve these issues, we propose SecureV2X, a scalable, multi-agent system for secure neural network inferences deployed between the server and each vehicle. Under this setting, we study two multi-agent V2X applications: secure drowsiness detection, and secure red-light violation detection. Our system achieves strong performance relative to baselines, and scales efficiently to support a large number of secure computation interactions simultaneously. For instance, SecureV2X is $9.4 \times$ faster, requires $143\times$ fewer computational rounds, and involves $16.6\times$ less communication on drowsiness detection compared to other secure systems. Moreover, it achieves a runtime nearly $100\times$ faster than state-of-the-art benchmarks in object detection tasks for red light violation detection.
CRJun 16, 2025
Rectifying Privacy and Efficacy Measurements in Machine Unlearning: A New Inference Attack PerspectiveNima Naderloui, Shenao Yan, Binghui Wang et al.
Machine unlearning focuses on efficiently removing specific data from trained models, addressing privacy and compliance concerns with reasonable costs. Although exact unlearning ensures complete data removal equivalent to retraining, it is impractical for large-scale models, leading to growing interest in inexact unlearning methods. However, the lack of formal guarantees in these methods necessitates the need for robust evaluation frameworks to assess their privacy and effectiveness. In this work, we first identify several key pitfalls of the existing unlearning evaluation frameworks, e.g., focusing on average-case evaluation or targeting random samples for evaluation, incomplete comparisons with the retraining baseline. Then, we propose RULI (Rectified Unlearning Evaluation Framework via Likelihood Inference), a novel framework to address critical gaps in the evaluation of inexact unlearning methods. RULI introduces a dual-objective attack to measure both unlearning efficacy and privacy risks at a per-sample granularity. Our findings reveal significant vulnerabilities in state-of-the-art unlearning methods, where RULI achieves higher attack success rates, exposing privacy risks underestimated by existing methods. Built on a game-based foundation and validated through empirical evaluations on both image and text data (spanning tasks from classification to generation), RULI provides a rigorous, scalable, and fine-grained methodology for evaluating unlearning techniques.
CRSep 29, 2020
Privacy Enhancement via Dummy Points in the Shuffle ModelXiaochen Li, Weiran Liu, Hanwen Feng et al.
The shuffle model is recently proposed to address the issue of severe utility loss in Local Differential Privacy (LDP) due to distributed data randomization.In the shuffle model, a shuffler is utilized to break the link between the user identity and the message uploaded to the data analyst. Since less noise needs to be introduced to achieve the same privacy guarantee, following this paradigm, the utility of privacy-preserving data collection is improved. We propose DUMP (\underline{DUM}my-\underline{P}oint-based), a framework for privacy-preserving histogram estimation in the shuffle model. The core of DUMP is a new concept of \emph{dummy blanket}, which enables enhancing privacy by just introducing \textit{points }on the user side and further improving the utility of the shuffle model.We instantiate DUMP by proposing two protocols: pureDUMP and mixDUMP, and conduct a comprehensive experimental evaluation to compare them with existing protocols. The experimental results show that, under the same privacy guarantee, (1) the proposed protocols have significant improvements in communication efficiency over all existing multi-message protocols, by at least 3 orders of magnitude; (2) they achieve competitive utility, while the only known protocol (Ghazi \textit{et al.}, PMLR 2020) having better utility than ours employs hard-to-exactly-sample distributions which are vulnerable to floating-point attacks (CCS 2012).
CRSep 11, 2020
Strengthening Order Preserving Encryption with Differential PrivacyAmrita Roy Chowdhury, Bolin Ding, Somesh Jha et al.
Ciphertexts of an order-preserving encryption (OPE) scheme preserve the order of their corresponding plaintexts. However, OPEs are vulnerable to inference attacks that exploit this preserved order. At another end, differential privacy has become the de-facto standard for achieving data privacy. One of the most attractive properties of DP is that any post-processing (inferential) computation performed on the noisy output of a DP algorithm does not degrade its privacy guarantee. In this paper, we propose a novel differentially private order preserving encryption scheme, OP$ε$. Under OP$ε$, the leakage of order from the ciphertexts is differentially private. As a result, in the least, OP$ε$ ensures a formal guarantee (specifically, a relaxed DP guarantee) even in the face of inference attacks. To the best of our knowledge, this is the first work to combine DP with a property-preserving encryption scheme. We demonstrate OP$ε$'s practical utility in answering range queries via extensive empirical evaluation on four real-world datasets. For instance, OP$ε$ misses only around $4$ in every $10K$ correct records on average for a dataset of size $\sim732K$ with an attribute of domain size $\sim18K$ and $ε= 1$.