LGMay 13, 2022
Federated Learning Under Intermittent Client Availability and Time-Varying Communication ConstraintsMonica Ribero, Haris Vikalo, Gustavo De Veciana
Federated learning systems facilitate training of global models in settings where potentially heterogeneous data is distributed across a large number of clients. Such systems operate in settings with intermittent client availability and/or time-varying communication constraints. As a result, the global models trained by federated learning systems may be biased towards clients with higher availability. We propose F3AST, an unbiased algorithm that dynamically learns an availability-dependent client selection strategy which asymptotically minimizes the impact of client-sampling variance on the global model convergence, enhancing performance of federated learning. The proposed algorithm is tested in a variety of settings for intermittently available clients under communication constraints, and its efficacy demonstrated on synthetic data and realistically federated benchmarking experiments using CIFAR100 and Shakespeare datasets. We show up to 186% and 8% accuracy improvements over FedAvg, and 8% and 7% over FedAdam on CIFAR100 and Shakespeare, respectively.
SYApr 13, 2012
Variability Aware Network Utility MaximizationVinay Joseph, Gustavo de Veciana
Network Utility Maximization (NUM) provides the key conceptual framework to study resource allocation amongst a collection of users/entities across disciplines as diverse as economics, law and engineering. In network engineering, this framework has been particularly insightful towards understanding how Internet protocols allocate bandwidth, and motivated diverse research on distributed mechanisms to maximize network utility while incorporating new relevant constraints, on energy/power, storage, stability, etc., for systems ranging from communication networks to the smart-grid. However when the available resources and/or users' utilities vary over time, a user's allocations will tend to vary, which in turn may have a detrimental impact on the users' utility or quality of experience. This paper introduces a generalized NUM framework which explicitly incorporates the detrimental impact of temporal variability in a user's allocated rewards. It explicitly incorporates tradeoffs amongst the mean and variability in users' allocations. We propose an online algorithm to realize variance-sensitive NUM, which, under stationary ergodic assumptions, is shown to be asymptotically optimal, i.e., achieves a time-average equal to that of an offline algorithm with knowledge of the future variability in the system. This substantially extends work on NUM to an interesting class of relevant problems where users/entities are sensitive to temporal variability in their service or allocated rewards.
LGJan 11, 2023
Network Adaptive Federated Learning: Congestion and Lossy CompressionParikshit Hegde, Gustavo de Veciana, Aryan Mokhtari
In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server. As such FL systems are exposed to, or indeed the cause of, congestion across a wide set of network resources. Lossy compression can be used to reduce the size of exchanged files and associated delays, at the cost of adding noise to model updates. By judiciously adapting clients' compression to varying network congestion, an FL application can reduce wall clock training time. To that end, we propose a Network Adaptive Compression (NAC-FL) policy, which dynamically varies the client's lossy compression choices to network congestion variations. We prove, under appropriate assumptions, that NAC-FL is asymptotically optimal in terms of directly minimizing the expected wall clock training time. Further, we show via simulation that NAC-FL achieves robust performance improvements with higher gains in settings with positively correlated delays across time.
SYOct 9, 2012
Resource Allocation: Realizing Mean-Variability-Fairness TradeoffsVinay Joseph, Gustavo de Veciana, Ari Arapostathis
Network Utility Maximization (NUM) provides a key conceptual framework to study reward allocation amongst a collection of users/entities across disciplines as diverse as economics, law and engineering. In network engineering, this framework has been particularly insightful towards understanding how Internet protocols allocate bandwidth, and motivated diverse research efforts on distributed mechanisms to maximize network utility while incorporating new relevant constraints, on energy, power, storage, stability, etc., e.g., for systems ranging from communication networks to the smart-grid. However when the available resources and/or users' utilities vary over time, reward allocations will tend to vary, which in turn may have a detrimental impact on the users' overall satisfaction or quality of experience. This paper introduces a generalization of NUM framework which explicitly incorporates the detrimental impact of temporal variability in a user's allocated rewards. It explicitly incorporates tradeoffs amongst the mean and variability in users' reward allocations, as well as fairness. We propose a simple online algorithm to realize these tradeoffs, which, under stationary ergodic assumptions, is shown to be asymptotically optimal, i.e., achieves a long term performance equal to that of an offline algorithm with knowledge of the future variability in the system. This substantially extends work on NUM to an interesting class of relevant problems where users/entities are sensitive to temporal variability in their service or allocated rewards.
LGMar 4
Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent FeedbackHaoran Zhang, Seohyeon Cha, Hasan Burhan Beytur et al.
Hierarchical inference systems route tasks across multiple computational layers, where each node may either finalize a prediction locally or offload the task to a node in the next layer for further processing. Learning optimal routing policies in such systems is challenging: inference loss is defined recursively across layers, while feedback on prediction error is revealed only at a terminal oracle layer. This induces a partial, policy-dependent feedback structure in which observability probabilities decay with depth, causing importance-weighted estimators to suffer from amplified variance. We study online routing for multi-layer hierarchical inference under long-term resource constraints and terminal-only feedback. We formalize the recursive loss structure and show that naive importance-weighted contextual bandit methods become unstable as feedback probability decays along the hierarchy. To address this, we develop a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback. We provide regret guarantees relative to the best fixed routing policy in hindsight and establish near-optimality under stochastic arrivals and resource constraints. Experiments on large-scale multi-task workloads demonstrate improved stability and performance compared to standard importance-weighted approaches.
LGFeb 5
Regularized Calibration with Successive Rounding for Post-Training QuantizationSeohyeon Cha, Huancheng Chen, Dongjun Kim et al.
Large language models (LLMs) deliver robust performance across diverse applications, yet their deployment often faces challenges due to the memory and latency costs of storing and accessing billions of parameters. Post-training quantization (PTQ) enables efficient inference by mapping pretrained weights to low-bit formats without retraining, but its effectiveness depends critically on both the quantization objective and the rounding procedure used to obtain low-bit weight representations. In this work, we show that interpolating between symmetric and asymmetric calibration acts as a form of regularization that preserves the standard quadratic structure used in PTQ while providing robustness to activation mismatch. Building on this perspective, we derive a simple successive rounding procedure that naturally incorporates asymmetric calibration, as well as a bounded-search extension that allows for an explicit trade-off between quantization quality and the compute cost. Experiments across multiple LLM families, quantization bit-widths, and benchmarks demonstrate that the proposed bounded search based on a regularized asymmetric calibration objective consistently improves perplexity and accuracy over PTQ baselines, while incurring only modest and controllable additional computational cost.
ITFeb 7, 2025
Generative Diffusion Model-based Compression of MIMO CSIHeasung Kim, Taekyun Lee, Hyeji Kim et al.
While neural lossy compression techniques have markedly advanced the efficiency of Channel State Information (CSI) compression and reconstruction for feedback in MIMO communications, efficient algorithms for more challenging and practical tasks-such as CSI compression for future channel prediction and reconstruction with relevant side information-remain underexplored, often resulting in suboptimal performance when existing methods are extended to these scenarios. To that end, we propose a novel framework for compression with side information, featuring an encoding process with fixed-rate compression using a trainable codebook for codeword quantization, and a decoding procedure modeled as a backward diffusion process conditioned on both the codeword and the side information. Experimental results show that our method significantly outperforms existing CSI compression algorithms, often yielding over twofold performance improvement by achieving comparable distortion at less than half the data rate of competing methods in certain scenarios. These findings underscore the potential of diffusion-based compression for practical deployment in communication systems.
LGDec 14, 2025
Optimal Resource Allocation for ML Model Training and Deployment under Concept DriftHasan Burhan Beytur, Gustavo de Veciana, Haris Vikalo et al.
We study how to allocate resources for training and deployment of machine learning (ML) models under concept drift and limited budgets. We consider a setting in which a model provider distributes trained models to multiple clients whose devices support local inference but lack the ability to retrain those models, placing the burden of performance maintenance on the provider. We introduce a model-agnostic framework that captures the interaction between resource allocation, concept drift dynamics, and deployment timing. We show that optimal training policies depend critically on the aging properties of concept durations. Under sudden concept changes, we derive optimal training policies subject to budget constraints when concept durations follow distributions with Decreasing Mean Residual Life (DMRL), and show that intuitive heuristics are provably suboptimal under Increasing Mean Residual Life (IMRL). We further study model deployment under communication constraints, prove that the associated optimization problem is quasi-convex under mild conditions, and propose a randomized scheduling strategy that achieves near-optimal client-side performance. These results offer theoretical and algorithmic foundations for cost-efficient ML model management under concept drift, with implications for continual learning, distributed inference, and adaptive ML systems.
LGAug 18, 2025
Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task InferenceSeohyeon Cha, Kevin Chan, Gustavo de Veciana et al.
The growing demand for intelligent services on resource-constrained edge devices has spurred the development of collaborative inference systems that distribute workloads across end devices, edge servers, and the cloud. While most existing frameworks focus on single-task, single-model scenarios, many real-world applications (e.g., autonomous driving and augmented reality) require concurrent execution of diverse tasks including detection, segmentation, and depth estimation. In this work, we propose a unified framework to jointly decide which multi-task models to deploy (onload) at clients and edge servers, and how to route queries across the hierarchy (offload) to maximize overall inference accuracy under memory, compute, and communication constraints. We formulate this as a mixed-integer program and introduce J3O (Joint Optimization of Onloading and Offloading), an alternating algorithm that (i) greedily selects models to onload via Lagrangian-relaxed submodular optimization and (ii) determines optimal offloading via constrained linear programming. We further extend J3O to account for batching at the edge, maintaining scalability under heterogeneous task loads. Experiments show J3O consistently achieves over $97\%$ of the optimal accuracy while incurring less than $15\%$ of the runtime required by the optimal solver across multi-task benchmarks.
LGAug 1, 2025
Optimal Scheduling Algorithms for LLM Inference: Theory and PracticeAgrim Bari, Parikshit Hegde, Gustavo de Veciana
With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints.
LGFeb 7, 2025
Importance Sampling via Score-based Generative ModelsHeasung Kim, Taekyun Lee, Hyeji Kim et al.
Importance sampling, which involves sampling from a probability density function (PDF) proportional to the product of an importance weight function and a base PDF, is a powerful technique with applications in variance reduction, biased or customized sampling, data augmentation, and beyond. Inspired by the growing availability of score-based generative models (SGMs), we propose an entirely training-free Importance sampling framework that relies solely on an SGM for the base PDF. Our key innovation is realizing the importance sampling process as a backward diffusion process, expressed in terms of the score function of the base PDF and the specified importance weight function--both readily available--eliminating the need for any additional training. We conduct a thorough analysis demonstrating the method's scalability and effectiveness across diverse datasets and tasks, including importance sampling for industrial and natural images with neural importance weight functions. The training-free aspect of our method is particularly compelling in real-world scenarios where a single base distribution underlies multiple biased sampling tasks, each requiring a different importance weight function. To the best of our knowledge our approach is the first importance sampling framework to achieve this.
DMJul 22, 2019
Performance-Complexity Tradeoffs in Greedy Weak Submodular Maximization with Random SamplingAbolfazl Hashemi, Haris Vikalo, Gustavo de Veciana
Many problems in signal processing and machine learning can be formalized as weak submodular optimization tasks. For such problems, a simple greedy algorithm (\textsc{Greedy}) is guaranteed to find a solution achieving the objective with a value no worse than $1-e^{-1/c}$ of the optimal, where $c$ is the multiplicative weak-submodularity constant. Due to the high cost of querying large-scale systems, the complexity of \textsc{Greedy} becomes prohibitive in contemporary applications. In this work, we study the tradeoff between performance and complexity when one resorts to random sampling strategies to reduce the query complexity of \textsc{Greedy}. Specifically, we quantify the effect of uniform sampling strategies on \textsc{Greedy}'s performance through two metrics: (i) probability of identifying an optimal subset, and (ii) suboptimality with respect to the optimal solution. The latter implies that uniform sampling strategies with a fixed sampling size achieve a non-trivial approximation factor; however, we show that with overwhelming probability, these methods fail to find the optimal subset. Our analysis shows that the failure of uniform sampling strategies with fixed sample size can be circumvented by successively increasing the size of the search space. Building upon this insight, we propose a simple progressive stochastic greedy algorithm and study its approximation guarantees. Moreover, we demonstrate effectiveness of the proposed method in dimensionality reduction applications and feature selection tasks for clustering and object tracking.
LGMar 17, 2019
Modeling and Optimization of Human-machine Interaction Processes via the Maximum Entropy PrincipleJiaxiao Zheng, Gustavo de Veciana
We propose a data-driven framework to enable the modeling and optimization of human-machine interaction processes, e.g., systems aimed at assisting humans in decision-making or learning, work-load allocation, and interactive advertising. This is a challenging problem for several reasons. First, humans' behavior is hard to model or infer, as it may reflect biases, long term memory, and sensitivity to sequencing, i.e., transience and exponential complexity in the length of the interaction. Second, due to the interactive nature of such processes, the machine policy used to engage with a human may bias possible data-driven inferences. Finally, in choosing machine policies that optimize interaction rewards, one must, on the one hand, avoid being overly sensitive to error/variability in the estimated human model, and on the other, being overly deterministic/predictable which may result in poor human 'engagement' in the interaction. To meet these challenges, we propose a robust approach, based on the maximum entropy principle, which iteratively estimates human behavior and optimizes the machine policy--Alternating Entropy-Reward Ascent (AREA) algorithm. We characterize AREA, in terms of its space and time complexity and convergence. We also provide an initial validation based on synthetic data generated by an established noisy nonlinear model for human decision-making.
MMNov 25, 2013
Rate Adaptation and Admission Control for Video Transmission with Subjective Quality ConstraintsChao Chen, Xiaoqing Zhu, Gustavo de Veciana et al.
Adapting video data rate during streaming can effectively reduce the risk of playback interruptions caused by channel throughput fluctuations. The variations in rate, however, also introduce video quality fluctuations and thus potentially affects viewers' Quality of Experience (QoE). We show how the QoE of video users can be improved by rate adaptation and admission control. We conducted a subjective study wherein we found that viewers' QoE was strongly correlated with the empirical cumulative distribution function (eCDF) of the predicted video quality. Based on this observation, we propose a rate-adaptation algorithm that can incorporate QoE constraints on the empirical cumulative quality distribution per user. We then propose a threshold-based admission control policy to block users whose empirical cumulative quality distribution is not likely to satisfy their QoE constraint. We further devise an online adaptation algorithm to automatically optimize the threshold. Extensive simulation results show that the proposed scheme can reduce network resource consumption by $40\%$ over conventional average-quality maximized rate-adaptation algorithms.
MMNov 25, 2013
Modeling the Time-varying Subjective Quality of HTTP Video Streams with Rate AdaptationsChao Chen, Lark Kwon Choi, Gustavo de Veciana et al.
Newly developed HTTP-based video streaming technologies enable flexible rate-adaptation under varying channel conditions. Accurately predicting the users' Quality of Experience (QoE) for rate-adaptive HTTP video streams is thus critical to achieve efficiency. An important aspect of understanding and modeling QoE is predicting the up-to-the-moment subjective quality of a video as it is played, which is difficult due to hysteresis effects and nonlinearities in human behavioral responses. This paper presents a Hammerstein-Wiener model for predicting the time-varying subjective quality (TVSQ) of rate-adaptive videos. To collect data for model parameterization and validation, a database of longer-duration videos with time-varying distortions was built and the TVSQs of the videos were measured in a large-scale subjective study. The proposed method is able to reliably predict the TVSQ of rate adaptive videos. Since the Hammerstein-Wiener model has a very simple structure, the proposed method is suitable for on-line TVSQ prediction in HTTP based streaming.
NIJul 27, 2013
NOVA: QoE-driven Optimization of DASH-based Video Delivery in NetworksVinay Joseph, Gustavo de Veciana
We consider the problem of optimizing video delivery for a network supporting video clients streaming stored video. Specifically, we consider the problem of jointly optimizing network resource allocation and video quality adaptation. Our objective is to fairly maximize video clients' Quality of Experience (QoE) realizing tradeoffs among the mean quality, temporal variability in quality, and fairness, incorporating user preferences on rebuffering and cost of video delivery. We present a simple asymptotically optimal online algorithm, NOVA, to solve the problem. NOVA is asynchronous, and using minimal communication, distributes the tasks of resource allocation to network controller, and quality adaptation to respective video clients. Video quality adaptation in NOVA is also optimal for standalone video clients, and is well suited for use with DASH framework. Further, we extend NOVA for use with more general QoE models, networks shared with other traffic loads and networks using fixed/legacy resource allocation.
MMSep 10, 2012
A Markov Decision Model for Adaptive Scheduling of Stored Scalable VideosChao Chen, Robert W. Heath, Alan C. Bovik et al.
We propose two scheduling algorithms that seek to optimize the quality of scalably coded videos that have been stored at a video server before transmission.} The first scheduling algorithm is derived from a Markov Decision Process (MDP) formulation developed here. We model the dynamics of the channel as a Markov chain and reduce the problem of dynamic video scheduling to a tractable Markov decision problem over a finite state space. Based on the MDP formulation, a near-optimal scheduling policy is computed that minimize the mean square error. Using insights taken from the development of the optimal MDP-based scheduling policy, the second proposed scheduling algorithm is an online scheduling method that only requires easily measurable knowledge of the channel dynamics, and is thus viable in practice. Simulation results show that the performance of both scheduling algorithms is close to a performance upper bound also derived in this paper.