DSMay 27
Efficient Algorithms for Interdicting Facilities in Trees and Bounded Treewidth GraphsAli Abbasi, Eli Friedman, Leana Golubchik et al.
Given a graph $G$ of $n$ nodes partitioned into facilities and customers, the $r$-edge interdiction covering problem (REIC) is to remove up to $r$ edges so as to maximize the total weight of customers disconnected from all facilities, which is called the covering objective function. While REIC is known to be NP-complete for general graphs, Fröhlich and Ruzika show that the problem can be solved in polynomial time when $G$ is a tree, providing an $O(n^7 r)$-time algorithm. We give an efficient $O(nr^2)$-time dynamic programming algorithm for REIC on trees that is fixed-parameter linear in $n$. Evaluating our solution on a benchmark of randomly generated tree networks with baselines of the Fröhlich and Ruzika algorithm and the Gurobi integer program solver, we demonstrate that in practice, our algorithm is both significantly faster and less sensitive to network topology and size. We extend our algorithm for REIC to graphs of bounded treewidth, a well-studied family of sparse graphs that generalizes trees, and obtain a matching runtime of $O(nr^2)$. We also consider the $r$-facility interdiction covering problem (RFIC), a novel variant of this network interdiction problem where the goal is to remove up to $r$ facilities to maximize the covering objective function over disconnected customers. We show that RFIC is NP-complete by observing it generalizes the small set bipartite vertex expansion problem (SSBVE), also known as the minimum $p$-union problem. We give an $O(nr^2)$-time algorithm for RFIC on trees, which also gives an $O(n^3)$-time algorithm for SSBVE on trees.
OCMar 3, 2023
Queue Scheduling with Adversarial Bandit LearningJiatai Huang, Leana Golubchik, Longbo Huang
In this paper, we study scheduling of a queueing system with zero knowledge of instantaneous network conditions. We consider a one-hop single-server queueing system consisting of $K$ queues, each with time-varying and non-stationary arrival and service rates. Our scheduling approach builds on an innovative combination of adversarial bandit learning and Lyapunov drift minimization, without knowledge of the instantaneous network state (the arrival and service rates) of each queue. We then present two novel algorithms \texttt{SoftMW} (SoftMaxWeight) and \texttt{SSMW} (Sliding-window SoftMaxWeight), both capable of stabilizing systems that can be stablized by some (possibly unknown) sequence of randomized policies whose time-variation satisfies a mild condition. We further generalize our results to the setting where arrivals and departures only have bounded moments instead of being deterministically bounded and propose \texttt{SoftMW+} and \texttt{SSMW+} that are capable of stabilizing the system. As a building block of our new algorithms, we also extend the classical \texttt{EXP3.S} (Auer et al., 2002) algorithm for multi-armed bandits to handle unboundedly large feedback signals, which can be of independent interest.
LGNov 4, 2023
Estimating Ground Reaction Forces from Inertial SensorsBowen Song, Marco Paolieri, Harper E. Stewart et al.
Objective: Our aim is to determine if data collected with inertial measurement units (IMUs) during steady-state running could be used to estimate ground reaction forces (GRFs) and to derive biomechanical variables (e.g., contact time, impulse, change in velocity) using lightweight machine-learning approaches. In contrast, state-of-the-art estimation using LSTMs suffers from prohibitive inference times on edge devices, requires expensive training and hyperparameter optimization, and results in black box models. Methods: We proposed a novel lightweight solution, SVD Embedding Regression (SER), using linear regression between SVD embeddings of IMU data and GRF data. We also compared lightweight solutions including SER and k-Nearest-Neighbors (KNN) regression with state-of-the-art LSTMs. Results: We performed extensive experiments to evaluate these techniques under multiple scenarios and combinations of IMU signals and quantified estimation errors for predicting GRFs and biomechanical variables. We did this using training data from different athletes, from the same athlete, or both, and we explored the use of acceleration and angular velocity data from sensors at different locations (sacrum and shanks). Conclusion: Our results illustrated that lightweight solutions such as SER and KNN can be similarly accurate or more accurate than LSTMs. The use of personal data reduced estimation errors of all methods, particularly for most biomechanical variables (as compared to GRFs); moreover, this gain was more pronounced in the lightweight methods. Significance: The study of GRFs is used to characterize the mechanical loading experienced by individuals in movements such as running, which is clinically applicable to identify athletes at risk for stress-related injuries.
PFOct 6, 2022
Inference Latency Prediction at the EdgeZhuojin Li, Marco Paolieri, Leana Golubchik
With the growing workload of inference tasks on mobile devices, state-of-the-art neural architectures (NAs) are typically designed through Neural Architecture Search (NAS) to identify NAs with good tradeoffs between accuracy and efficiency (e.g., latency). Since measuring the latency of a huge set of candidate architectures during NAS is not scalable, approaches are needed for predicting end-to-end inference latency on mobile devices. Such predictions are challenging due to hardware heterogeneity, optimizations applied by ML frameworks, and the diversity of neural architectures. Motivated by these challenges, in this paper, we first quantitatively assess characteristics of neural architectures and mobile devices that have significant effects on inference latency. Based on this assessment, we propose a latency prediction framework which addresses these challenges by developing operation-wise latency predictors, under a variety of settings and a number of hardware devices, with multi-core CPUs and GPUs, achieving high accuracy in end-to-end latency prediction, as shown by our comprehensive evaluations. To illustrate that our approach does not require expensive data collection, we also show that accurate predictions can be achieved on real-world NAs using only small amounts of profiling data.
CVOct 29, 2025
A Study on Inference Latency for Vision Transformers on Mobile DevicesZhuojin Li, Marco Paolieri, Leana Golubchik
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
LGOct 24, 2025
Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-ExecutionZhuojin Li, Marco Paolieri, Leana Golubchik
Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pixel~5 smartphone).
LGMar 31, 2021
Achieving Transparency Report Privacy in Linear TimeChien-Lun Chen, Leana Golubchik, Ranjan Pal
An accountable algorithmic transparency report (ATR) should ideally investigate the (a) transparency of the underlying algorithm, and (b) fairness of the algorithmic decisions, and at the same time preserve data subjects' privacy. However, a provably formal study of the impact to data subjects' privacy caused by the utility of releasing an ATR (that investigates transparency and fairness), is yet to be addressed in the literature. The far-fetched benefit of such a study lies in the methodical characterization of privacy-utility trade-offs for release of ATRs in public, and their consequential application-specific impact on the dimensions of society, politics, and economics. In this paper, we first investigate and demonstrate potential privacy hazards brought on by the deployment of transparency and fairness measures in released ATRs. To preserve data subjects' privacy, we then propose a linear-time optimal-privacy scheme, built upon standard linear fractional programming (LFP) theory, for announcing ATRs, subject to constraints controlling the tolerance of privacy perturbation on the utility of transparency schemes. Subsequently, we quantify the privacy-utility trade-offs induced by our scheme, and analyze the impact of privacy perturbation on fairness measures in ATRs. To the best of our knowledge, this is the first analytical work that simultaneously addresses trade-offs between the triad of privacy, utility, and fairness, applicable to algorithmic transparency reports.
LGJun 12, 2020
Backdoor Attacks on Federated Meta-LearningChien-Lun Chen, Leana Golubchik, Marco Paolieri
Federated learning allows multiple users to collaboratively train a shared classification model while preserving data privacy. This approach, where model updates are aggregated by a central server, was shown to be vulnerable to poisoning backdoor attacks: a malicious user can alter the shared model to arbitrarily classify specific inputs from a given class. In this paper, we analyze the effects of backdoor attacks on federated meta-learning, where users train a model that can be adapted to different sets of output classes using only a few examples. While the ability to adapt could, in principle, make federated learning frameworks more robust to backdoor attacks (when new training examples are benign), we find that even 1-shot~attacks can be very successful and persist after additional training. To address these vulnerabilities, we propose a defense mechanism inspired by matching networks, where the class of an input is predicted from the similarity of its features with a support set of labeled examples. By removing the decision logic from the model shared with the federation, success and persistence of backdoor attacks are greatly reduced.
DCNov 12, 2019
Throughput Prediction of Asynchronous SGD in TensorFlowZhuojin Li, Wumo Yan, Marco Paolieri et al.
Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due to communication overhead and bottlenecks, the total throughput of SGD updates in a cluster scales sublinearly, saturating as the number of nodes increases. In this paper, we present a solution to predicting training throughput from profiling traces collected from a single-node configuration. Our approach is able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node. By accounting for the dependencies between received parts and pending computations, we predict overlaps between computation and communication and generate synthetic execution traces for configurations with multiple nodes. We validate our approach on TensorFlow training jobs for popular image classification neural networks, on AWS and on our in-house cluster, using nodes equipped with GPUs or only with CPUs. We also investigate the effects of data transmission policies used in TensorFlow and the accuracy of our approach when combined with optimizations of the transmission schedule.