LGOct 30, 2023
PriPrune: Quantifying and Preserving Privacy in Pruned Federated LearningTianyue Chu, Mengwei Yang, Nikolaos Laoutaris et al.
Federated learning (FL) is a paradigm that allows several client devices and a server to collaboratively train a global model, by exchanging only model updates, without the devices sharing their local training data. These devices are often constrained in terms of communication and computation resources, and can further benefit from model pruning -- a paradigm that is widely used to reduce the size and complexity of models. Intuitively, by making local models coarser, pruning is expected to also provide some protection against privacy attacks in the context of FL. However this protection has not been previously characterized, formally or experimentally, and it is unclear if it is sufficient against state-of-the-art attacks. In this paper, we perform the first investigation of privacy guarantees for model pruning in FL. We derive information-theoretic upper bounds on the amount of information leaked by pruned FL models. We complement and validate these theoretical findings, with comprehensive experiments that involve state-of-the-art privacy attacks, on several state-of-the-art FL pruning schemes, using benchmark datasets. This evaluation provides valuable insights into the choices and parameters that can affect the privacy protection provided by pruning. Based on these insights, we introduce PriPrune -- a privacy-aware algorithm for local model pruning, which uses a personalized per-client defense mask and adapts the defense pruning rate so as to jointly optimize privacy and model performance. PriPrune is universal in that can be applied after any pruned FL scheme on the client, without modification, and protects against any inversion attack by the server. Our empirical evaluation demonstrates that PriPrune significantly improves the privacy-accuracy tradeoff compared to state-of-the-art pruned FL schemes that do not take privacy into account.
LGMay 21, 2024
Maverick-Aware Shapley Valuation for Client Selection in Federated LearningMengwei Yang, Ismat Jarin, Baturalp Buyukates et al.
Federated Learning (FL) allows clients to train a model collaboratively without sharing their private data. One key challenge in practical FL systems is data heterogeneity, particularly in handling clients with rare data, also referred to as Mavericks. These clients own one or more data classes exclusively, and the model performance becomes poor without their participation. Thus, utilizing Mavericks throughout training is crucial. In this paper, we first design a Maverick-aware Shapley valuation that fairly evaluates the contribution of Mavericks. The main idea is to compute the clients' Shapley values (SV) class-wise, i.e., per label. Next, we propose FedMS, a Maverick-Shapley client selection mechanism for FL that intelligently selects the clients that contribute the most in each round, by employing our Maverick-aware SV-based contribution score. We show that, compared to an extensive list of baselines, FedMS achieves better model performance and fairer Shapley Rewards distribution.
LGFeb 13, 2025
AutoLike: Auditing Social Media Recommendations through User InteractionsHieu Le, Salma Elmalaki, Zubair Shafiq et al.
Modern social media platforms, such as TikTok, Facebook, and YouTube, rely on recommendation systems to personalize content for users based on user interactions with endless streams of content, such as "For You" pages. However, these complex algorithms can inadvertently deliver problematic content related to self-harm, mental health, and eating disorders. We introduce AutoLike, a framework to audit recommendation systems in social media platforms for topics of interest and their sentiments. To automate the process, we formulate the problem as a reinforcement learning problem. AutoLike drives the recommendation system to serve a particular type of content through interactions (e.g., liking). We apply the AutoLike framework to the TikTok platform as a case study. We evaluate how well AutoLike identifies TikTok content automatically across nine topics of interest; and conduct eight experiments to demonstrate how well it drives TikTok's recommendation system towards particular topics and sentiments. AutoLike has the potential to assist regulators in auditing recommendation systems for problematic content. (Warning: This paper contains qualitative examples that may be viewed as offensive or harmful.)
LGFeb 25, 2022
AutoFR: Automated Filter Rule Generation for AdblockingHieu Le, Salma Elmalaki, Athina Markopoulou et al.
Adblocking relies on filter lists, which are manually curated and maintained by a community of filter list authors. Filter list curation is a laborious process that does not scale well to a large number of sites or over time. In this paper, we introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation for sites of interest. We design an algorithm based on multi-arm bandits to generate filter rules that block ads while controlling the trade-off between blocking ads and avoiding visual breakage. We test AutoFR on thousands of sites and we show that it is efficient: it takes only a few minutes to generate filter rules for a site of interest. AutoFR is effective: it generates filter rules that can block 86% of the ads, as compared to 87% by EasyList, while achieving comparable visual breakage. Furthermore, AutoFR generates filter rules that generalize well to new sites. We envision that AutoFR can assist the adblocking community in filter rule generation at scale.
LGFeb 8, 2022
A Unified Prediction Framework for Signal MapsEmmanouil Alimpertis, Athina Markopoulou, Carter T. Butts et al.
Signal maps are essential for the planning and operation of cellular networks. However, the measurements needed to create such maps are expensive, often biased, not always reflecting the metrics of interest, and posing privacy risks. In this paper, we develop a unified framework for predicting cellular signal maps from limited measurements. Our framework builds on a state-of-the-art random-forest predictor, or any other base predictor. We propose and combine three mechanisms that deal with the fact that not all measurements are equally important for a particular prediction task. First, we design quality-of-service functions ($Q$), including signal strength (RSRP) but also other metrics of interest to operators, i.e., coverage and call drop probability. By implicitly altering the loss function employed in learning, quality functions can also improve prediction for RSRP itself where it matters (e.g., MSE reduction up to 27% in the low signal strength regime, where errors are critical). Second, we introduce weight functions ($W$) to specify the relative importance of prediction at different locations and other parts of the feature space. We propose re-weighting based on importance sampling to obtain unbiased estimators when the sampling and target distributions are different. This yields improvements up to 20% for targets based on spatially uniform loss or losses based on user population density. Third, we apply the Data Shapley framework for the first time in this context: to assign values ($φ$) to individual measurement points, which capture the importance of their contribution to the prediction task. This improves prediction (e.g., from 64% to 94% in recall for coverage loss) by removing points with negative values, and can also enable data minimization. We evaluate our methods and demonstrate significant improvement in prediction performance, using several real-world datasets.
LGDec 7, 2021
Location Leakage in Federated Signal MapsEvita Bakopoulou, Mengwei Yang, Jiang Zhang et al.
We consider the problem of predicting cellular network performance (signal maps) from measurements collected by several mobile devices. We formulate the problem within the online federated learning framework: (i) federated learning (FL) enables users to collaboratively train a model, while keeping their training data on their devices; (ii) measurements are collected as users move around over time and are used for local training in an online fashion. We consider an honest-but-curious server, who observes the updates from target users participating in FL and infers their location using a deep leakage from gradients (DLG) type of attack, originally developed to reconstruct training data of DNN image classifiers. We make the key observation that a DLG attack, applied to our setting, infers the average location of a batch of local data, and can thus be used to reconstruct the target users' trajectory at a coarse granularity. We build on this observation to protect location privacy, in our setting, by revisiting and designing mechanisms within the federated learning framework including: tuning the FL parameters for averaging, curating local batches so as to mislead the DLG attacker, and aggregating across multiple users with different trajectories. We evaluate the performance of our algorithms through both analysis and simulation based on real-world mobile datasets, and we show that they achieve a good privacy-utility tradeoff.
CRJun 9, 2021
OVRseen: Auditing Network Traffic and Privacy Policies in Oculus VRRahmadi Trimananda, Hieu Le, Hao Cui et al.
Virtual reality (VR) is an emerging technology that enables new applications but also introduces privacy risks. In this paper, we focus on Oculus VR (OVR), the leading platform in the VR space and we provide the first comprehensive analysis of personal data exposed by OVR apps and the platform itself, from a combined networking and privacy policy perspective. We experimented with the Quest 2 headset and tested the most popular VR apps available on the official Oculus and the SideQuest app stores. We developed OVRseen, a methodology and system for collecting, analyzing, and comparing network traffic and privacy policies on OVR. On the networking side, we captured and decrypted network traffic of VR apps, which was previously not possible on OVR, and we extracted data flows, defined as <app, data type, destination>. Compared to the mobile and other app ecosystems, we found OVR to be more centralized and driven by tracking and analytics, rather than by third-party advertising. We show that the data types exposed by VR apps include personally identifiable information (PII), device information that can be used for fingerprinting, and VR-specific data types. By comparing the data flows found in the network traffic with statements made in the apps' privacy policies, we found that approximately 70% of OVR data flows were not properly disclosed. Furthermore, we extracted additional context from the privacy policies, and we observed that 69% of the data flows were used for purposes unrelated to the core functionality of apps.
CRAug 19, 2020
Exposures Exposed: A Measurement and User Study to Assess Mobile Data Privacy in ContextEvita Bakopoulou, Anastasia Shuba, Athina Markopoulou
Mobile devices have access to personal, potentially sensitive data, and there is a large number of mobile applications and third-party libraries that transmit this information over the network to remote servers (including app developer servers and third party servers). In this paper, we are interested in better understanding of not just the extent of personally identifiable information (PII) exposure, but also its context i.e., functionality of the app, destination server, encryption used, etc.) and the risk perceived by mobile users today. To that end we take two steps. First, we perform a measurement study: we collect a new dataset via manual and automatic testing and capture the exposure of 16 PII types from 400 most popular Android apps. We analyze these exposures and provide insights into the extent and patterns of mobile apps sharing PII, which can be later used for prediction and prevention. Second, we perform a user study with 220 participants on Amazon Mechanical Turk: we summarize the results of the measurement study in categories, present them in a realistic context, and assess users' understanding, concern, and willingness to take action. To the best of our knowledge, our user study is the first to collect and analyze user input in such fine granularity and on actual (not just potential or permitted) privacy exposures on mobile devices. Although many users did not initially understand the full implications of their PII being exposed, after being better informed through the study, they became appreciative and interested in better privacy practices.
LGJul 30, 2019
A Federated Learning Approach for Mobile Packet ClassificationEvita Bakopoulou, Balint Tillman, Athina Markopoulou
In order to improve mobile data transparency, a number of network-based approaches have been proposed to inspect packets generated by mobile devices and detect personally identifiable information (PII), ad requests, or other activities. State-of-the-art approaches train classifiers based on features extracted from HTTP packets. So far, these classifiers have only been trained in a centralized way, where mobile users label and upload their packet logs to a central server, which then trains a global classifier and shares it with the users to apply on their devices. However, packet logs used as training data may contain sensitive information that users may not want to share/upload. In this paper, we apply, for the first time, a Federated Learning approach to mobile packet classification, which allows mobile devices to collaborate and train a global model, without sharing raw training data. Methodological challenges we address in this context include: model and feature selection, and tuning the Federated Learning parameters. We apply our framework to two different packet classification tasks (i.e., to predict PII exposure or ad requests in HTTP packets) and we demonstrate its effectiveness in terms of classification performance, communication and computation cost, using three real-world datasets.
NIJul 26, 2019
PingPong: Packet-Level Signatures for Smart Home Device EventsRahmadi Trimananda, Janus Varmarken, Athina Markopoulou et al.
Smart home devices are vulnerable to passive inference attacks based on network traffic, even in the presence of encryption. In this paper, we present PINGPONG, a tool that can automatically extract packet-level signatures for device events (e.g., light bulb turning ON/OFF) from network traffic. We evaluated PINGPONG on popular smart home devices ranging from smart plugs and thermostats to cameras, voice-activated devices, and smart TVs. We were able to: (1) automatically extract previously unknown signatures that consist of simple sequences of packet lengths and directions; (2) use those signatures to detect the devices or specific events with an average recall of more than 97%; (3) show that the signatures are unique among hundreds of millions of packets of real world network traffic; (4) show that our methodology is also applicable to publicly available datasets; and (5) demonstrate its robustness in different settings: events triggered by local and remote smartphones, as well as by homeautomation systems.
NIMar 3, 2018
AntShield: On-Device Detection of Personal Information ExposureAnastasia Shuba, Evita Bakopoulou, Milad Asgari Mehrabadi et al.
Mobile devices have access to personal, potentially sensitive data, and there is a growing number of applications that transmit this personally identifiable information (PII) over the network. In this paper, we present the AntShield system that performs on-device packet-level monitoring and detects the transmission of such sensitive information accurately and in real-time. A key insight is to distinguish PII that is predefined and is easily available on the device from PII that is unknown a priori but can be automatically detected by classifiers. Our system not only combines, for the first time, the advantages of on-device monitoring with the power of learning unknown PII, but also outperforms either of the two approaches alone. We demonstrate the real-time performance of our prototype as well as the classification performance using a dataset that we collect and analyze from scratch (including new findings in terms of leaks and patterns). AntShield is a first step towards enabling distributed learning of private information exposure.
NIMay 14, 2014
MicroCast: Cooperative Video Streaming using Cellular and D2D ConnectionsAnh Le, Lorenzo Keller, Hulya Seferoglu et al.
We consider a group of mobile users, within proximity of each other, who are interested in watching the same online video at roughly the same time. The common practice today is that each user downloads the video independently on her mobile device using her own cellular connection, which wastes access bandwidth and may also lead to poor video quality. We propose a novel cooperative system where each mobile device uses simultaneously two network interfaces: (i) the cellular to connect to the video server and download parts of the video and (ii) WiFi to connect locally to all other devices in the group and exchange those parts. Devices cooperate to efficiently utilize all network resources and are able to adapt to varying wireless network conditions. In the local WiFi network, we exploit overhearing, and we further combine it with network coding. The end result is savings in cellular bandwidth and improved user experience (faster download) by a factor on the order up to the group size. We follow a complete approach, from theory to practice. First, we formulate the problem using a network utility maximization (NUM) framework, decompose the problem, and provide a distributed solution. Then, based on the structure of the NUM solution, we design a modular system called MicroCast and we implement it as an Android application. We provide both simulation results of the NUM solution and experimental evaluation of MicroCast on a testbed consisting of Android phones. We demonstrate that the proposed approach brings significant performance benefits without battery penalty.
CRMar 8, 2012
Auditing for Distributed Storage SystemsAnh Le, Athina Markopoulou, Alexandros G. Dimakis
Distributed storage codes have recently received a lot of attention in the community. Independently, another body of work has proposed integrity checking schemes for cloud storage, none of which, however, is customized for coding-based storage or can efficiently support repair. In this work, we bridge the gap between these two currently disconnected bodies of work. We propose NC-Audit, a novel cryptography-based remote data integrity checking scheme, designed specifically for network coding-based distributed storage systems. NC-Audit combines, for the first time, the following desired properties: (i) efficient checking of data integrity, (ii) efficient support for repairing failed nodes, and (iii) protection against information leakage when checking is performed by a third party. The key ingredient of the design of NC-Audit is a novel combination of SpaceMac, a homomorphic message authentication code (MAC) scheme for network coding, and NCrypt, a novel chosen-plaintext attack (CPA) secure encryption scheme that is compatible with SpaceMac. Our evaluation of a Java implementation of NC-Audit shows that an audit costs the storage node and the auditor a modest amount computation time and lower bandwidth than prior work.