CRJul 6, 2023
DPM: Clustering Sensitive Data through SeparationJohannes Liebenow, Yara Schütt, Tanya Braun et al.
Clustering is an important tool for data exploration where the goal is to subdivide a data set into disjoint clusters that fit well into the underlying data structure. When dealing with sensitive data, privacy-preserving algorithms aim to approximate the non-private baseline while minimising the leakage of sensitive information. State-of-the-art privacy-preserving clustering algorithms tend to output clusters that are good in terms of the standard metrics, inertia, silhouette score, and clustering accuracy, however, the clustering result strongly deviates from the non-private KMeans baseline. In this work, we present a privacy-preserving clustering algorithm called DPM that recursively separates a data set into clusters based on a geometrical clustering approach. In addition, DPM estimates most of the data-dependent hyper-parameters in a privacy-preserving way. We prove that DPM preserves Differential Privacy and analyse the utility guarantees of DPM. Finally, we conduct an extensive empirical evaluation for synthetic and real-life data sets. We show that DPM achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.
NEApr 10
Beyond Silicon: Materials, Mechanisms, and Methods for Physical Neural ComputingStefan Fischer, Nihat Ay, Olaf Landsiedel et al.
Physical implementations of neural computation now extend far beyond silicon hardware, encompassing substrates such as memristive devices, photonic circuits, mechanical metamaterials, microfluidic networks, chemical reaction systems, and living neural tissue. By exploiting intrinsic physical processes such as charge transport, wave interference, elastic deformation, mass transport, and biochemical regulation, these substrates can realize neural inference and adaptation directly in matter. As silicon GPU-centered AI faces growing energy and data-movement constraints, physical neural computation is becoming increasingly relevant as a complementary path beyond conventional digital accelerators. This trend is driven in particular by pervasive intelligence, i.e., the deployment of on-device and edge AI across large numbers of resource-constrained systems. In such settings, co-locating computation with sensing and memory can reduce data shuttling and improve efficiency. Meanwhile, physical neural approaches have emerged across disparate disciplines, yet progress remains fragmented, with limited shared terminology and few principled ways to compare platforms. This survey unifies the field by mapping neural primitives to substrate-specific mechanisms, analyzing architectural and training paradigms, and identifying key engineering constraints including scalability, precision, programmability, and I/O interfacing overhead. To enable cross-domain comparison, we introduce a first-order benchmarking scheme based on standardized static and dynamic tasks and physically interpretable performance dimensions. We show that no single substrate dominates across the considered dimensions; instead, physical neural systems occupy complementary operating regimes, enabling applications ranging from ultrafast signal processing and in-memory inference to embodied control and in-sample biochemical decision making.
CRNov 3, 2022
Private Blind Model Averaging - Distributed, Non-interactive, and ConvergentMoritz Kirschte, Sebastian Meiser, Saman Ardalan et al.
Distributed differentially private learning techniques enable a large number of users to jointly learn a model without having to first centrally collect the training data. At the same time, neither the communication between the users nor the resulting model shall leak information about the training data. This kind of learning technique can be deployed to edge devices if it can be scaled up to a large number of users, particularly if the communication is reduced to a minimum: no interaction, i.e., each party only sends a single message. The best previously known methods are based on gradient averaging, which inherently requires many synchronization rounds. A promising non-interactive alternative to gradient averaging relies on so-called output perturbation: each user first locally finishes training and then submits its model for secure averaging without further synchronization. We analyze this paradigm, which we coin blind model averaging (BlindAvg), in the setting of convex and smooth empirical risk minimization (ERM) like a support vector machine (SVM). While the required noise scale is asymptotically the same as in the centralized setting, it is not well understood how close BlindAvg comes to centralized learning, i.e., its utility cost. We characterize and boost the privacy-utility tradeoff of BlindAvg with two contributions: First, we prove that BlindAvg converges towards the centralized setting for a sufficiently strong L2-regularization for a non-smooth SVM learner. Second, we introduce the novel differentially private convex and smooth ERM learner SoftmaxReg that has a better privacy-utility tradeoff than an SVM in a multi-class setting. We evaluate our findings on three datasets (CIFAR-10, CIFAR-100, and Federated EMNIST) and provide an ablation in an artificially extreme non-IID scenario.
CRSep 21, 2023
S-BDT: Distributed Differentially Private Boosted Decision TreesThorsten Peinemann, Moritz Kirschte, Joshua Stock et al.
We introduce S-BDT: a novel $(\varepsilon,δ)$-differentially private distributed gradient boosted decision tree (GBDT) learner that improves the protection of single training data points (privacy) while achieving meaningful learning goals, such as accuracy or regression error (utility). S-BDT uses less noise by relying on non-spherical multivariate Gaussian noise, for which we show tight subsampling bounds for privacy amplification and incorporate that into a Rényi filter for individual privacy accounting. We experimentally reach the same utility while saving $50\%$ in terms of epsilon for $\varepsilon \le 0.5$ on the Abalone regression dataset (dataset size $\approx 4K$), saving $30\%$ in terms of epsilon for $\varepsilon \le 0.08$ for the Adult classification dataset (dataset size $\approx 50K$), and saving $30\%$ in terms of epsilon for $\varepsilon\leq0.03$ for the Spambase classification dataset (dataset size $\approx 5K$). Moreover, we show that for situations where a GBDT is learning a stream of data that originates from different subpopulations (non-IID), S-BDT improves the saving of epsilon even further.
LGOct 6, 2025
DP-HYPE: Distributed Differentially Private Hyperparameter SearchJohannes Liebenow, Thorsten Peinemann, Esfandiar Mohammadi
The tuning of hyperparameters in distributed machine learning can substantially impact model performance. When the hyperparameters are tuned on sensitive data, privacy becomes an important challenge and to this end, differential privacy has emerged as the de facto standard for provable privacy. A standard setting when performing distributed learning tasks is that clients agree on a shared setup, i.e., find a compromise from a set of hyperparameters, like the learning rate of the model to be trained. Yet, prior work on differentially private hyperparameter tuning either uses computationally expensive cryptographic protocols, determines hyperparameters separately for each client, or applies differential privacy locally, which can lead to undesirable utility-privacy trade-offs. In this work, we present our algorithm DP-HYPE, which performs a distributed and privacy-preserving hyperparameter search by conducting a distributed voting based on local hyperparameter evaluations of clients. In this way, DP-HYPE selects hyperparameters that lead to a compromise supported by the majority of clients, while maintaining scalability and independence from specific learning tasks. We prove that DP-HYPE preserves the strong notion of differential privacy called client-level differential privacy and, importantly, show that its privacy guarantees do not depend on the number of hyperparameters. We also provide bounds on its utility guarantees, that is, the probability of reaching a compromise, and implement DP-HYPE as a submodule in the popular Flower framework for distributed machine learning. In addition, we evaluate performance on multiple benchmark data sets in iid as well as multiple non-iid settings and demonstrate high utility of DP-HYPE even under small privacy budgets.
LGAug 7, 2025
Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classificationThorsten Peinemann, Paula Arnold, Sebastian Berndt et al.
Backdoor injection attacks are a threat to machine learning models that are trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples, but need much information about the data points or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression and linear classification. For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We also validate our theoretical results experimentally with realistic benchmark data sets.
CRJul 27, 2021
Learning Numeric Optimal Differentially Private Truncated Additive MechanismsDavid M. Sommer, Lukas Abfalterer, Sheila Zingg et al.
Differentially private (DP) mechanisms face the challenge of providing accurate results while protecting their inputs: the privacy-utility trade-off. A simple but powerful technique for DP adds noise to sensitivity-bounded query outputs to blur the exact query output: additive mechanisms. While a vast body of work considers infinitely wide noise distributions, some applications (e.g., real-time operating systems) require hard bounds on the deviations from the real query, and only limited work on such mechanisms exist. An additive mechanism with truncated noise (i.e., with bounded range) can offer such hard bounds. We introduce a gradient-descent-based tool to learn truncated noise for additive mechanisms with strong utility bounds while simultaneously optimizing for differential privacy under sequential composition, i.e., scenarios where multiple noisy queries on the same data are revealed. Our method can learn discrete noise patterns and not only hyper-parameters of a predefined probability distribution. For sensitivity bounded mechanisms, we show that it is sufficient to consider symmetric and that\new{, for from the mean monotonically falling noise,} ensuring privacy for a pair of representative query outputs guarantees privacy for all pairs of inputs (that differ in one element). We find that the utility-privacy trade-off curves of our generated noise are remarkably close to truncated Gaussians and even replicate their shape for $l_2$ utility-loss. For a low number of compositions, we also improved DP-SGD (sub-sampling). Moreover, we extend Moments Accountant to truncated distributions, allowing to incorporate mechanism output events with varying input-dependent zero occurrence probability.
LGJul 20, 2021
Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Challenges and SolutionsEike Petersen, Yannik Potdevin, Esfandiar Mohammadi et al.
Machine learning is expected to fuel significant improvements in medical care. To ensure that fundamental principles such as beneficence, respect for human autonomy, prevention of harm, justice, privacy, and transparency are respected, medical machine learning systems must be developed responsibly. Many high-level declarations of ethical principles have been put forth for this purpose, but there is a severe lack of technical guidelines explicating the practical consequences for medical machine learning. Similarly, there is currently considerable uncertainty regarding the exact regulatory requirements placed upon medical machine learning systems. This survey provides an overview of the technical and procedural challenges involved in creating medical machine learning systems responsibly and in conformity with existing regulations, as well as possible solutions to address these challenges. First, a brief review of existing regulations affecting medical machine learning is provided, showing that properties such as safety, robustness, reliability, privacy, security, transparency, explainability, and nondiscrimination are all demanded already by existing law and regulations - albeit, in many cases, to an uncertain degree. Next, the key technical obstacles to achieving these desirable properties are discussed, as well as important techniques to overcome these obstacles in the medical context. We notice that distribution shift, spurious correlations, model underspecification, uncertainty quantification, and data scarcity represent severe challenges in the medical context. Promising solution approaches include the use of large and representative datasets and federated learning as a means to that end, the careful exploitation of domain knowledge, the use of inherently transparent models, comprehensive out-of-distribution model testing and verification, as well as algorithmic impact assessments.
CRMay 2, 2019
Differential privacy with partial knowledgeDamien Desfontaines, Esfandiar Mohammadi, Elisabeth Krahmer et al.
Differential privacy offers formal quantitative guarantees for algorithms over datasets, but it assumes attackers that know and can influence all but one record in the database. This assumption often vastly overapproximates the attackers' actual strength, resulting in unnecessarily poor utility. Recent work has made significant steps towards privacy in the presence of partial background knowledge, which can model a realistic attacker's uncertainty. Prior work, however, has definitional problems for correlated data and does not precisely characterize the underlying attacker model. We propose a practical criterion to prevent problems due to correlations, and we show how to characterize attackers with limited influence or only partial background knowledge over the dataset. We use these foundations to analyze practical scenarios: we significantly improve known results about the privacy of counting queries under partial knowledge, and we show that thresholding can provide formal guarantees against such weak attackers, even with little entropy in the data. These results allow us to draw novel links between k-anonymity and differential privacy under partial knowledge. Finally, we prove composition results on differential privacy with partial knowledge, which quantifies the privacy leakage of complex mechanisms. Our work provides a basis for formally quantifying the privacy of many widely-used mechanisms, e.g. publishing the result of surveys, elections or referendums, and releasing usage statistics of online services.
CRAug 15, 2016
Computational Soundness for Dalvik BytecodeMichael Backes, Robert Künnemann, Esfandiar Mohammadi
Automatically analyzing information flow within Android applications that rely on cryptographic operations with their computational security guarantees imposes formidable challenges that existing approaches for understanding an app's behavior struggle to meet. These approaches do not distinguish cryptographic and non-cryptographic operations, and hence do not account for cryptographic protections: f(m) is considered sensitive for a sensitive message m irrespective of potential secrecy properties offered by a cryptographic operation f. These approaches consequently provide a safe approximation of the app's behavior, but they mistakenly classify a large fraction of apps as potentially insecure and consequently yield overly pessimistic results. In this paper, we show how cryptographic operations can be faithfully included into existing approaches for automated app analysis. To this end, we first show how cryptographic operations can be expressed as symbolic abstractions within the comprehensive Dalvik bytecode language. These abstractions are accessible to automated analysis, and they can be conveniently added to existing app analysis tools using minor changes in their semantics. Second, we show that our abstractions are faithful by providing the first computational soundness result for Dalvik bytecode, i.e., the absence of attacks against our symbolically abstracted program entails the absence of any attacks against a suitable cryptographic program realization. We cast our computational soundness result in the CoSP framework, which makes the result modular and composable.