CRMay 20Code
Auditing Apple's DifferentialPrivacy.framework: Implementation Bugs, Misconfigurations, and Practical RisksRishav Chourasia, Ergute Bao, Uzair Javaid et al.
Since 2016, Apple has claimed that device analytics collected to improve user experience are protected by differential privacy (DP). Apple's DifferentialPrivacy.framework is deployed across its operating systems and handles sensitive signals such as Safari domains, keyboard events, photo attributes, and health-related reports. Because Apple has not open-sourced its privatization algorithms, these privacy claims have been difficult to verify independently. We present a client-side audit of Apple's DP framework on macOS Sonoma 14.2 and Sequoia 15.6. We reverse engineer the shipped binaries, recover Objective-C interfaces, build runtime harnesses that execute Apple's deployed mechanisms, and test whether their outputs match the advertised privacy guarantees. Our audit covers nearly all active deployed mechanisms, including Count Median Sketch, Hadamard-CMS, randomized-response mechanisms, and Prio-style secure aggregation. We find multiple implementation bugs and misconfigurations. Every audited mechanism that relies on floating-point noise fails to meet its advertised DP or zero-knowledge proof guarantee, due to insecure samplers with known floating-point vulnerabilities. We also find secure-aggregation configurations with local DP disabled, exposing pre-aggregation records to any party with access to those logs. Overall, we find DP violations in 5 of 9 audited mechanisms, affecting 87% of data collection in macOS Sonoma and 68% in Sequoia. We also identify public leaked iPhone logs that can be decoded to recover private information, including Safari domains and keyboard emoji signals.
LGSep 25, 2024
CombU: A Combined Unit Activation for Fitting Mathematical Expressions with Neural NetworksJiayu Li, Zilong Zhao, Kevin Yee et al.
The activation functions are fundamental to neural networks as they introduce non-linearity into data relationships, thereby enabling deep networks to approximate complex data relations. Existing efforts to enhance neural network performance have predominantly focused on developing new mathematical functions. However, we find that a well-designed combination of existing activation functions within a neural network can also achieve this objective. In this paper, we introduce the Combined Units activation (CombU), which employs different activation functions at various dimensions across different layers. This approach can be theoretically proven to fit most mathematical expressions accurately. The experiments conducted on four mathematical expression datasets, compared against six State-Of-The-Art (SOTA) activation function algorithms, demonstrate that CombU outperforms all SOTA algorithms in 10 out of 16 metrics and ranks in the top three for the remaining six metrics.
CVNov 28, 2025Code
Instruction Tuning of Large Language Models for Tabular Data Generation-in One DayMilad Abdollahzadeh, Abdul Raheem, Zilong Zhao et al.
Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.
LGJan 2, 2025
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-TransformerJiayu Li, Bingyin Zhao, Zilong Zhao et al.
Transformers have shown impressive results in tabular data generation. However, they lack domain-specific inductive biases which are critical for preserving the intrinsic characteristics of tabular data. They also suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that integrates inductive biases of tree-based models (i.e., non-smoothness and non-rotational invariance) to effectively handle the discrete and weakly correlated features in tabular datasets. To improve numerical fidelity and capture multimodal distributions, we introduce a novel tokenizer that learns token sequences based on the complexity of tabular values. This reduces vocabulary size and sequence length, yielding more compact and efficient representations without sacrificing performance. We evaluate TabTreeFormer on nine diverse datasets, benchmarking against eight generative models. We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency. Notably, in scenarios prioritizing data utility over privacy and efficiency, the best variant of TabTreeFormer delivers a 44% performance gain relative to its baseline variant.
LGNov 14, 2024
Laplace Transform Interpretation of Differential PrivacyRishav Chourasia, Uzair Javaid, Biplap Sikdar
We introduce a set of useful expressions of Differential Privacy (DP) notions in terms of the Laplace transform of the privacy loss distribution. Its bare form expression appears in several related works on analyzing DP, either as an integral or an expectation. We show that recognizing the expression as a Laplace transform unlocks a new way to reason about DP properties by exploiting the duality between time and frequency domains. Leveraging our interpretation, we connect the $(q, ρ(q))$-Rényi DP curve and the $(ε, δ(ε))$-DP curve as being the Laplace and inverse-Laplace transforms of one another. This connection shows that the Rényi divergence is well-defined for complex orders $q = γ+ i ω$. Using our Laplace transform-based analysis, we also prove an adaptive composition theorem for $(ε, δ)$-DP guarantees that is exactly tight (i.e., matches even in constants) for all values of $ε$. Additionally, we resolve an issue regarding symmetry of $f$-DP on subsampling that prevented equivalence across all functional DP notions.