Sonia Ben Mokhtar

h-index29

19papers

299citations

Novelty51%

AI Score46

Ranked #61,713 of 201,326 authors (top 31%)#14,024 in LG (top 33%)

19 Papers

IRAug 9, 2022

PEPPER: Empowering User-Centric Recommender Systems over Gossip Learning

Yacine Belal, Aurélien Bellet, Sonia Ben Mokhtar et al.

Recommender systems are proving to be an invaluable tool for extracting user-relevant content helping users in their daily activities (e.g., finding relevant places to visit, content to consume, items to purchase). However, to be effective, these systems need to collect and analyze large volumes of personal data (e.g., location check-ins, movie ratings, click rates .. etc.), which exposes users to numerous privacy threats. In this context, recommender systems based on Federated Learning (FL) appear to be a promising solution for enforcing privacy as they compute accurate recommendations while keeping personal data on the users' devices. However, FL, and therefore FL-based recommender systems, rely on a central server that can experience scalability issues besides being vulnerable to attacks. To remedy this, we propose PEPPER, a decentralized recommender system based on gossip learning principles. In PEPPER, users gossip model updates and aggregate them asynchronously. At the heart of PEPPER reside two key components: a personalized peer-sampling protocol that keeps in the neighborhood of each node, a proportion of nodes that have similar interests to the former and a simple yet effective model aggregation function that builds a model that is better suited to each user. Through experiments on three real datasets implementing two use cases: a location check-in recommendation and a movie recommendation, we demonstrate that our solution converges up to 42% faster than with other decentralized solutions providing up to 9% improvement on average performance metric such as hit ratio and up to 21% improvement on long tail performance compared to decentralized competitors.

CRAug 11, 2022

Shielding Federated Learning Systems against Inference Attacks with ARM TrustZone

Aghiles Ait Messaoud, Sonia Ben Mokhtar, Vlad Nitu et al.

Federated Learning (FL) opens new perspectives for training machine learning models while keeping personal data on the users premises. Specifically, in FL, models are trained on the users devices and only model updates (i.e., gradients) are sent to a central server for aggregation purposes. However, the long list of inference attacks that leak private data from gradients, published in the recent years, have emphasized the need of devising effective protection mechanisms to incentivize the adoption of FL at scale. While there exist solutions to mitigate these attacks on the server side, little has been done to protect users from attacks performed on the client side. In this context, the use of Trusted Execution Environments (TEEs) on the client side are among the most proposing solutions. However, existing frameworks (e.g., DarkneTZ) require statically putting a large portion of the machine learning model into the TEE to effectively protect against complex attacks or a combination of attacks. We present GradSec, a solution that allows protecting in a TEE only sensitive layers of a machine learning model, either statically or dynamically, hence reducing both the TCB size and the overall training time by up to 30% and 56%, respectively compared to state-of-the-art competitors.

IRJun 15, 2023

Inferring Communities of Interest in Collaborative Learning-based Recommender Systems

Yacine Belal, Sonia Ben Mokhtar, Mohamed Maouche et al.

Collaborative-learning-based recommender systems, such as those employing Federated Learning (FL) and Gossip Learning (GL), allow users to train models while keeping their history of liked items on their devices. While these methods were seen as promising for enhancing privacy, recent research has shown that collaborative learning can be vulnerable to various privacy attacks. In this paper, we propose a novel attack called Community Inference Attack (CIA), which enables an adversary to identify community members based on a set of target items. What sets CIA apart is its efficiency: it operates at low computational cost by eliminating the need for training surrogate models. Instead, it uses a comparison-based approach, inferring sensitive information by comparing users' models rather than targeting any specific individual model. To evaluate the effectiveness of CIA, we conduct experiments on three real-world recommendation datasets using two recommendation models under both Federated and Gossip-like settings. The results demonstrate that CIA can be up to 10 times more accurate than random guessing. Additionally, we evaluate two mitigation strategies: Differentially Private Stochastic Gradient Descent (DP-SGD) and a Share less policy, which involves sharing fewer, less sensitive model parameters. Our findings suggest that the Share less strategy offers a better privacy-utility trade-off, especially in GL.

LGNov 4, 2024

Differentially private and decentralized randomized power method

Julien Nicolas, César Sabater, Mohamed Maouche et al.

The randomized power method has gained significant interest due to its simplicity and efficient handling of large-scale spectral analysis and recommendation tasks. However, its application to large datasets containing personal information (e.g., web interactions, search history, personal tastes) raises critical privacy problems. This paper addresses these issues by proposing enhanced privacy-preserving variants of the method. First, we propose a variant that reduces the amount of the noise required in current techniques to achieve Differential Privacy (DP). More precisely, we refine the privacy analysis so that the Gaussian noise variance no longer grows linearly with the target rank, achieving the same DP guarantees with strictly less noise. Second, we adapt our method to a decentralized framework in which data is distributed among multiple users. The decentralized protocol strengthens privacy guarantees with no accuracy penalty and a low computational and communication overhead. Our results include the provision of tighter convergence bounds for both the centralized and decentralized versions, and an empirical comparison with previous work using real recommendation datasets.

LGJul 4, 2025

Communication Efficient, Differentially Private Distributed Optimization using Correlation-Aware Sketching

Julien Nicolas, Mohamed Maouche, Sonia Ben Mokhtar et al.

Federated learning with differential privacy suffers from two major costs: each client must transmit $d$-dimensional gradients every round, and the magnitude of DP noise grows with $d$. Yet empirical studies show that gradient updates exhibit strong temporal correlations and lie in a $k$-dimensional subspace with $k \ll d$. Motivated by this, we introduce DOME, a decentralized DP optimization framework in which each client maintains a compact sketch to project gradients into $\mathbb{R}^k$ before privatization and Secure Aggregation. This reduces per-round communication from order $d$ to order $k$ and moves towards a gradient approximation mean-squared error of $σ^2 k$. To allow the sketch to span new directions and prevent it from collapsing onto historical gradients, we augment it with random probes orthogonal to historical directions. We prove that our overall protocol satisfies $(ε,δ)$-Differential Privacy.

LGSep 5, 2025

On the Normalization of Confusion Matrices: Methods and Geometric Interpretations

Johan Erbani, Pierre-Edouard Portier, Elod Egyed-Zsigmond et al.

The confusion matrix is a standard tool for evaluating classifiers by providing insights into class-level errors. In heterogeneous settings, its values are shaped by two main factors: class similarity -- how easily the model confuses two classes -- and distribution bias, arising from skewed distributions in the training and test sets. However, confusion matrix values reflect a mix of both factors, making it difficult to disentangle their individual contributions. To address this, we introduce bistochastic normalization using Iterative Proportional Fitting, a generalization of row and column normalization. Unlike standard normalizations, this method recovers the underlying structure of class similarity. By disentangling error sources, it enables more accurate diagnosis of model behavior and supports more targeted improvements. We also show a correspondence between confusion matrix normalizations and the model's internal class representations. Both standard and bistochastic normalizations can be interpreted geometrically in this space, offering a deeper understanding of what normalization reveals about a classifier.

LGJun 11, 2025

A Weighted Loss Approach to Robust Federated Learning under Data Heterogeneity

Johan Erbani, Sonia Ben Mokhtar, Pierre-Edouard Portier et al.

Federated learning (FL) is a machine learning paradigm that enables multiple data holders to collaboratively train a machine learning model without sharing their training data with external parties. In this paradigm, workers locally update a model and share with a central server their updated gradients (or model parameters). While FL seems appealing from a privacy perspective, it opens a number of threats from a security perspective as (Byzantine) participants can contribute poisonous gradients (or model parameters) harming model convergence. Byzantine-resilient FL addresses this issue by ensuring that the training proceeds as if Byzantine participants were absent. Towards this purpose, common strategies ignore outlier gradients during model aggregation, assuming that Byzantine gradients deviate more from honest gradients than honest gradients do from each other. However, in heterogeneous settings, honest gradients may differ significantly, making it difficult to distinguish honest outliers from Byzantine ones. In this paper, we introduce the Worker Label Alignement Loss (WoLA), a weighted loss that aligns honest worker gradients despite data heterogeneity, which facilitates the identification of Byzantines' gradients. This approach significantly outperforms state-of-the-art methods in heterogeneous settings. In this paper, we provide both theoretical insights and empirical evidence of its effectiveness.

CRJun 4, 2025

Dropout-Robust Mechanisms for Differentially Private and Fully Decentralized Mean Estimation

César Sabater, Sonia Ben Mokhtar, Jan Ramon

Achieving differentially private computations in decentralized settings poses significant challenges, particularly regarding accuracy, communication cost, and robustness against information leakage. While cryptographic solutions offer promise, they often suffer from high communication overhead or require centralization in the presence of network failures. Conversely, existing fully decentralized approaches typically rely on relaxed adversarial models or pairwise noise cancellation, the latter suffering from substantial accuracy degradation if parties unexpectedly disconnect. In this work, we propose IncA, a new protocol for fully decentralized mean estimation, a widely used primitive in data-intensive processing. Our protocol, which enforces differential privacy, requires no central orchestration and employs low-variance correlated noise, achieved by incrementally injecting sensitive information into the computation. First, we theoretically demonstrate that, when no parties permanently disconnect, our protocol achieves accuracy comparable to that of a centralized setting-already an improvement over most existing decentralized differentially private techniques. Second, we empirically show that our use of low-variance correlated noise significantly mitigates the accuracy loss experienced by existing techniques in the presence of dropouts.

LGApr 24, 2025

GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

Yacine Belal, Mohamed Maouche, Sonia Ben Mokhtar et al.

Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent GL approaches rely on dynamic communication graphs built and maintained using Random Peer Sampling (RPS) protocols. Thanks to graph dynamics, GL can achieve fast convergence even over extremely sparse topologies. However, the robustness of GL over dy- namic graphs to Byzantine (model poisoning) attacks remains unaddressed especially when Byzantine nodes attack the RPS protocol to scale up model poisoning. We address this issue by introducing GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of a fraction of Byzantine nodes. GRANITE relies on two key components (i) a History-aware Byzantine-resilient Peer Sampling protocol (HaPS), which tracks previously encountered identifiers to reduce adversarial influence over time, and (ii) an Adaptive Probabilistic Threshold (APT), which leverages an estimate of Byzantine presence to set aggregation thresholds with formal guarantees. Empirical results confirm that GRANITE maintains convergence with up to 30% Byzantine nodes, improves learning speed via adaptive filtering of poisoned models and obtains these results in up to 9 times sparser graphs than dictated by current theory.

LGDec 17, 2024

Exposing the Vulnerability of Decentralized Learning to Membership Inference Attacks Through the Lens of Graph Mixing

Ousmane Touat, Jezekael Brunon, Yacine Belal et al.

The primary promise of decentralized learning is to allow users to engage in the training of machine learning models in a collaborative manner while keeping their data on their premises and without relying on any central entity. However, this paradigm necessitates the exchange of model parameters or gradients between peers. Such exchanges can be exploited to infer sensitive information about training data, which is achieved through privacy attacks (e.g., Membership Inference Attacks -- MIA). In order to devise effective defense mechanisms, it is important to understand the factors that increase/reduce the vulnerability of a given decentralized learning architecture to MIA. In this study, we extensively explore the vulnerability to MIA of various decentralized learning architectures by varying the graph structure (e.g., number of neighbors), the graph dynamics, and the aggregation strategy, across diverse datasets and data distributions. Our key finding, which to the best of our knowledge we are the first to report, is that the vulnerability to MIA is heavily correlated to (i) the local model mixing strategy performed by each node upon reception of models from neighboring nodes and (ii) the global mixing properties of the communication graph. We illustrate these results experimentally using four datasets and by theoretically analyzing the mixing properties of various decentralized architectures. We also empirically show that enhancing mixing properties is highly beneficial when combined with other privacy-preserving techniques such as Differential Privacy. Our paper draws a set of lessons learned for devising decentralized learning systems that reduce by design the vulnerability to MIA.

LGMay 9, 2023

Survey of Federated Learning Models for Spatial-Temporal Mobility Applications

Yacine Belal, Sonia Ben Mokhtar, Hamed Haddadi et al.

Federated learning involves training statistical models over edge devices such as mobile phones such that the training data is kept local. Federated Learning (FL) can serve as an ideal candidate for training spatial temporal models that rely on heterogeneous and potentially massive numbers of participants while preserving the privacy of highly sensitive location data. However, there are unique challenges involved with transitioning existing spatial temporal models to decentralized learning. In this survey paper, we review the existing literature that has proposed FL-based models for predicting human mobility, traffic prediction, community detection, location-based recommendation systems, and other spatial-temporal tasks. We describe the metrics and datasets these works have been using and create a baseline of these approaches in comparison to the centralized settings. Finally, we discuss the challenges of applying spatial-temporal models in a decentralized setting and by highlighting the gaps in the literature we provide a road map and opportunities for the research community.

LGMar 19, 2021

Enhancing Robustness of On-line Learning Models on Highly Noisy Data

Zilong Zhao, Robert Birke, Rui Han et al.

Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we extend a two-layer on-line data selection framework: Robust Anomaly Detector (RAD) with a newly designed ensemble prediction where both layers contribute to the final anomaly detection decision. To adapt to the on-line nature of anomaly detection, we consider additional features of conflicting opinions of classifiers, repetitive cleaning, and oracle knowledge. We on-line learn from incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, and (iii) recognising 100 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98.95% for IoT device attacks (i.e., +7%), up to 85.03% for cloud task failures (i.e., +14%) under 40% label noise, and for its extension, it can reach up to 77.51% for face recognition (i.e., +39%) under 30% label noise. The proposed RAD and its extensions are general and can be applied to different anomaly detection algorithms.

LGNov 11, 2019

RAD: On-line Anomaly Detection for Highly Unreliable Data

Zilong Zhao, Robert Birke, Rui Han et al.

Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we present a two-layer on-line learning framework for robust anomaly detection (RAD) in the presence of unreliable anomaly labels, where the first layer is to filter out the suspicious data, and the second layer detects the anomaly patterns from the remaining data. To adapt to the on-line nature of anomaly detection, we extend RAD with additional features of repetitively cleaning, conflicting opinions of classifiers, and oracle knowledge. We on-line learn from the incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, (iii) recognising 20 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98% for IoT device attacks (i.e., +11%), up to 84% for cloud task failures (i.e., +20%) under 40% noise, and up to 74% for face recognition (i.e., +28%) under 30% noisy labels. The proposed RAD is general and can be applied to different anomaly detection algorithms.

DCMay 4, 2018

X-Search: Revisiting Private Web Search using Intel SGX

Sonia Ben Mokhtar, Antoine Boutet, Pascal Felber et al.

The exploitation of user search queries by search engines is at the heart of their economic model. As consequence, offering private Web search functionalities is essential to the users who care about their privacy. Nowadays, there exists no satisfactory approach to enable users to access search engines in a privacy-preserving way. Existing solutions are either too costly due to the heavy use of cryptographic mechanisms (e.g., private information retrieval protocols), subject to attacks (e.g., Tor, TrackMeNot, GooPIR) or rely on weak adversarial models (e.g., PEAS). This paper introduces X-Search , a novel private Web search mechanism building on the disruptive Software Guard Extensions (SGX) proposed by Intel. We compare X-Search to its closest competitors, Tor and PEAS, using a dataset of real web search queries. Our evaluation shows that: (1) X-Search offers stronger privacy guarantees than its competitors as it operates under a stronger adversarial model; (2) it better resists state-of-the-art re-identification attacks; and (3) from the performance perspective, X-Search outperforms its competitors both in terms of latency and throughput by orders of magnitude.

DCMay 3, 2018

CYCLOSA: Decentralizing Private Web Search Through SGX-Based Browser Extensions

Rafael Pires, David Goltzsche, Sonia Ben Mokhtar et al.

By regularly querying Web search engines, users (unconsciously) disclose large amounts of their personal data as part of their search queries, among which some might reveal sensitive information (e.g. health issues, sexual, political or religious preferences). Several solutions exist to allow users querying search engines while improving privacy protection. However, these solutions suffer from a number of limitations: some are subject to user re-identification attacks, while others lack scalability or are unable to provide accurate results. This paper presents CYCLOSA, a secure, scalable and accurate private Web search solution. CYCLOSA improves security by relying on trusted execution environments (TEEs) as provided by Intel SGX. Further, CYCLOSA proposes a novel adaptive privacy protection solution that reduces the risk of user re- identification. CYCLOSA sends fake queries to the search engine and dynamically adapts their count according to the sensitivity of the user query. In addition, CYCLOSA meets scalability as it is fully decentralized, spreading the load for distributing fake queries among other nodes. Finally, CYCLOSA achieves accuracy of Web search as it handles the real query and the fake queries separately, in contrast to other existing solutions that mix fake and real query results.

CRSep 23, 2016

Adaptive Location Privacy with ALP

Vincent Primault, Antoine Boutet, Sonia Ben Mokhtar et al.

With the increasing amount of mobility data being collected on a daily basis by location-based services (LBSs) comes a new range of threats for users, related to the over-sharing of their location information. To deal with this issue, several location privacy protection mechanisms (LPPMs) have been proposed in the past years. However, each of these mechanisms comes with different configuration parameters that have a direct impact both on the privacy guarantees offered to the users and on the resulting utility of the protected data. In this context, it can be difficult for non-expert system designers to choose the appropriate configuration to use. Moreover, these mechanisms are generally configured once for all, which results in the same configuration for every protected piece of information. However, not all users have the same behaviour, and even the behaviour of a single user is likely to change over time. To address this issue, we present in this paper ALP, a new framework enabling the dynamic configuration of LPPMs. ALP can be used in two scenarios: (1) offline, where ALP enables a system designer to choose and automatically tune the most appropriate LPPM for the protection of a given dataset; (2) online, where ALP enables the user of a crowd sensing application to protect consecutive batches of her geolocated data by automatically tuning an existing LPPM to fulfil a set of privacy and utility objectives. We evaluate ALP on both scenarios with two real-life mobility datasets and two state-of-the-art LPPMs. Our experiments show that the adaptive LPPM configurations found by ALP outperform both in terms of privacy and utility a set of static configurations manually fixed by a system designer.

CRJul 2, 2015

Time Distortion Anonymization for the Publication of Mobility Data with High Utility

Vincent Primault, Sonia Ben Mokhtar, Cédric Lauradoux et al.

An increasing amount of mobility data is being collected every day by different means, such as mobile applications or crowd-sensing campaigns. This data is sometimes published after the application of simple anonymization techniques (e.g., putting an identifier instead of the users' names), which might lead to severe threats to the privacy of the participating users. Literature contains more sophisticated anonymization techniques, often based on adding noise to the spatial data. However, these techniques either compromise the privacy if the added noise is too little or the utility of the data if the added noise is too strong. We investigate in this paper an alternative solution, which builds on time distortion instead of spatial distortion. Specifically, our contribution lies in (1) the introduction of the concept of time distortion to anonymize mobility datasets (2) Promesse, a protection mechanism implementing this concept (3) a practical study of Promesse compared to two representative spatial distortion mechanisms, namely Wait For Me, which enforces k-anonymity, and Geo-Indistinguishability, which enforces differential privacy. We evaluate our mechanism practically using three real-life datasets. Our results show that time distortion reduces the number of points of interest that can be retrieved by an adversary to under 3 %, while the introduced spatial error is almost null and the distortion introduced on the results of range queries is kept under 13 % on average.

CRJun 30, 2015

Privacy-preserving Publication of Mobility Data with High Utility

Vincent Primault, Sonia Ben Mokhtar, Lionel Brunie

An increasing amount of mobility data is being collected every day by different means, e.g., by mobile phone operators. This data is sometimes published after the application of simple anonymization techniques, which might lead to severe privacy threats. We propose in this paper a new solution whose novelty is twofold. Firstly, we introduce an algorithm designed to hide places where a user stops during her journey (namely points of interest), by enforcing a constant speed along her trajectory. Secondly, we leverage places where users meet to take a chance to swap their trajectories and therefore confuse an attacker.

CROct 28, 2014

Differentially Private Location Privacy in Practice

Vincent Primault, Sonia Ben Mokhtar, Cedric Lauradoux et al.

With the wide adoption of handheld devices (e.g. smartphones, tablets) a large number of location-based services (also called LBSs) have flourished providing mobile users with real-time and contextual information on the move. Accounting for the amount of location information they are given by users, these services are able to track users wherever they go and to learn sensitive information about them (e.g. their points of interest including home, work, religious or political places regularly visited). A number of solutions have been proposed in the past few years to protect users location information while still allowing them to enjoy geo-located services. Among the most robust solutions are those that apply the popular notion of differential privacy to location privacy (e.g. Geo-Indistinguishability), promising strong theoretical privacy guarantees with a bounded accuracy loss. While these theoretical guarantees are attracting, it might be difficult for end users or practitioners to assess their effectiveness in the wild. In this paper, we carry on a practical study using real mobility traces coming from two different datasets, to assess the ability of Geo-Indistinguishability to protect users' points of interest (POIs). We show that a curious LBS collecting obfuscated location information sent by mobile users is still able to infer most of the users POIs with a reasonable both geographic and semantic precision. This precision depends on the degree of obfuscation applied by Geo-Indistinguishability. Nevertheless, the latter also has an impact on the overhead incurred on mobile devices resulting in a privacy versus overhead trade-off. Finally, we show in our study that POIs constitute a quasi-identifier for mobile users and that obfuscating them using Geo-Indistinguishability is not sufficient as an attacker is able to re-identify at least 63% of them despite a high degree of obfuscation.