Josep M. Pujol

LGJul 8, 2019

Privacy-Preserving Classification with Secret Vector Machines

Valentin Hartmann, Konark Modi, Josep M. Pujol et al.

Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers, phones, or Internet-of-things devices. Many companies collect such data with the goal of using it for training machine learning models allowing them to improve their services. User-held data is, however, often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm, but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious. We thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information. We evaluate our model in two ways. First, in an offline evaluation, we train SecVM to predict user gender from tweets, showing that we can preserve user privacy without sacrificing classification performance. Second, we implement SecVM's distributed framework for the Cliqz web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients, outperforming baselines by a large margin and thus showcasing that SecVM is suitable for production environments.

CRDec 19, 2018

Preventing Attacks on Anonymous Data Collection

Alex Catarineu, Philipp Claßen, Konark Modi et al.

Anonymous data collection systems allow users to contribute the data necessary to build services and applications while preserving their privacy. Anonymity, however, can be abused by malicious agents aiming to subvert or to sabotage the data collection, for instance by injecting fabricated data. In this paper we propose an efficient mechanism to rate-limit an attacker without compromising the privacy and anonymity of the users contributing data. The proposed system builds on top of Direct Anonymous Attestation, a proven cryptographic primitive. We describe how a set of rate-limiting rules can be formalized to define a normative space in which messages sent by an attacker can be linked, and consequently, dropped. We present all components needed to build and deploy such protection on existing data collection systems with little overhead. Empirical evaluation yields performance up to 125 and 140 messages per second for senders and the collector respectively on nominal hardware. Latency of communication is bound to 4 seconds in the 95th percentile when using Tor as network layer.

Josep M. Pujol

2 Papers