DCFeb 18
DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit CuttingPrabhjot Singh, Adel N. Toosi, Rajkumar Buyya
Circuit cutting decomposes a large quantum circuit into a collection of smaller subcircuits. The outputs of these subcircuits are then classically reconstructed to recover the original expectation values. While prior work characterises cutting overhead largely in terms of subcircuit counts and sampling complexity, its end-to-end impact on iterative, estimator-driven training pipelines remains insufficiently measured from a systems perspective. In this paper, we propose a cut-aware estimator execution pipeline that treats circuit cutting as a staged distributed workload and instruments each estimator query into partitioning, subexperiment generation, parallel execution, and classical reconstruction phases. Using logged runtime traces and learning outcomes on two binary classification workloads (Iris and MNIST), we quantify cutting overheads, scaling limits, and sensitivity to injected stragglers, and we evaluate whether accuracy and robustness are preserved under matched training budgets. Our measurements show that cutting introduces substantial end-to-end overheads that grow with the number of cuts, and that reconstruction constitutes a dominant fraction of per-query time, bounding achievable speed-up under increased parallelism. Despite these systems costs, test accuracy and robustness are preserved in the measured regimes, with configuration-dependent improvements observed in some cut settings. These results indicate that practical scaling of circuit cutting for learning workloads hinges on reducing and overlapping reconstruction and on scheduling policies that account for barrier-dominated critical paths.
DCApr 28
Janus: Disaggregating Attention and Experts for Scalable MoE InferenceZhexiang Zhang, Ye Wang, Yumiao Zhao et al.
Serving large Mixture-of-Experts (MoE) models is challenging because of their large memory footprints, heterogeneous resource demands, and highly dynamic inference workloads. Most existing MoE inference systems deploy the entire model as a monolithic unit, forcing attention and MoE layers to share the same resource configuration despite their different scaling behaviors and resource bottlenecks. Such coarse-grained provisioning leads to resource inefficiency and suboptimal performance. We present JANUS, a scalable and resource-efficient MoE inference system built around three key principles. First, JANUS disaggregates attention and MoE layers onto separate GPU worker pools, enabling independent resource provisioning for the two layer types, and uses an adaptive two-phase communication mechanism for low-latency data exchange. Second, because MoE-layer execution is often memory-bound and highly sensitive to activated-expert imbalance, JANUS introduces a lightweight, microsecond-scale activation scheduler that balances per-layer activated experts across MoE instances to reduce inference latency. Third, JANUS employs a fine-grained, SLO-aware resource scaling scheme that jointly selects attention resources, MoE resources, and expert placement to minimize GPU cost under token-level SLOs. Evaluation shows that JANUS improves per-GPU throughput by up to 4.7x over state-of-the-art MoE inference baselines while satisfying token-level latency SLOs.
CLJan 7
IntroLM: Introspective Language Models via Prefilling-Time Self-EvaluationHossein Hosseini Kasnavieh, Gholamreza Haffari, Chris Leckie et al.
A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.
CVSep 25, 2024
Benchmarking Deep Learning Models for Object Detection on Edge Computing DevicesDaghash K. Alqahtani, Aamir Cheema, Adel N. Toosi
Modern applications, such as autonomous vehicles, require deploying deep learning algorithms on resource-constrained edge devices for real-time image and video processing. However, there is limited understanding of the efficiency and performance of various object detection models on these devices. In this paper, we evaluate state-of-the-art object detection models, including YOLOv8 (Nano, Small, Medium), EfficientDet Lite (Lite0, Lite1, Lite2), and SSD (SSD MobileNet V1, SSDLite MobileDet). We deployed these models on popular edge devices like the Raspberry Pi 3, 4, and 5 with/without TPU accelerators, and Jetson Orin Nano, collecting key performance metrics such as energy consumption, inference time, and Mean Average Precision (mAP). Our findings highlight that lower mAP models such as SSD MobileNet V1 are more energy-efficient and faster in inference, whereas higher mAP models like YOLOv8 Medium generally consume more energy and have slower inference, though with exceptions when accelerators like TPUs are used. Among the edge devices, Jetson Orin Nano stands out as the fastest and most energy-efficient option for request handling, despite having the highest idle energy consumption. These results emphasize the need to balance accuracy, speed, and energy efficiency when deploying deep learning models on edge devices, offering valuable guidance for practitioners and researchers selecting models and devices for their applications.
DCApr 14
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM ServingXu Bai, Muhammed Tawfiqul Islam, Chen Wang et al.
Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background synchronization risks inconsistency as states evolve. We present PipeLive, which enables live in-place PP reconfiguration with minimal disruption. PipeLive introduces a redesigned KV cache layout together with a co-designed extension to PageAttention, forming a unified mechanism for live KV resizing. It further adopts an incremental KV patching mechanism, inspired by live virtual machine migration, to synchronize KV states between source and target configurations and identify a safe switch point. PipeLive achieves a 2.5X reduction in time-to-first-token (TTFT) without KV cache overflow compared to disabling KV resizing. Furthermore, compared to a variant without KV patching, it reduces reconfiguration overhead from seconds to under 10ms, and improves TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%, respectively.
NIApr 13
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM ServingHossein Hosseini Kasnavieh, Christopher Leckie, Adel N. Toosi
Multi-model LLM routing has emerged as an effective approach for reducing serving cost and latency while maintaining output quality by assigning each prompt to an appropriate model. However, prior routing methods typically assume that each model has a fixed latency. In real deployments, this assumption is inaccurate: multiple models often share limited GPU resources, and a model's latency depends strongly on both its allocated resources and the request load induced by the routing policy. Consequently, routing and resource allocation are tightly coupled. In this work, we study joint resource allocation and routing for latency-aware multi-model LLM serving in GPU clusters. Given a set of deployed models and a latency service-level objective (SLO), we seek a system setup and routing policy that maximize overall output quality while satisfying the latency target. We formalize this problem as a constrained joint optimization over deployment setup and routing fractions, and propose RouterWise, which combines a dual-price formulation for score-maximizing routing with setup-specific latency models derived from system profiling. RouterWise searches over feasible system setups and, for each fixed setup, computes the best routing policy under the latency target. Our results show that even on the same GPU cluster, achievable output-quality score can vary by up to 87% across retained setups, highlighting that resource allocation is a key determinant of routing performance.
DCMar 16
Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection SystemsDaghash K. Alqahtani, Maria A. Rodriguez, Muhammad Aamir Cheema et al.
The rapid proliferation of the Internet of Things (IoT) and smart applications has led to a surge in data generated by distributed sensing devices. Edge computing is a mainstream approach to managing this data by pushing computation closer to the data source, typically onto resource-constrained devices such as single-board computers (SBCs). In such environments, the unavoidable heterogeneity of hardware and software makes effective load balancing particularly challenging. In this paper, we propose a multi-objective load balancing method tailored to heterogeneous, edge-based object detection systems. We study a setting in which multiple device-model pairs expose distinct accuracy, latency, and energy profiles, while both request intensity and scene complexity fluctuate over time. To handle this dynamically varying environment, our approach uses a two-stage decision mechanism: it first performs accuracy-aware filtering to identify suitable device-model candidates that provide accuracy within the acceptable range, and then applies a weighted-sum scoring function over expected latency and energy consumption to select the final execution target. We evaluate the proposed load balancer through extensive experiments on real-world datasets, comparing against widely used baseline strategies. The results indicate that the proposed multi-objective load balancing method halves energy consumption and achieves an 80% reduction in end-to-end latency, while incurring only a modest, up to 10%, decrease in detection accuracy relative to an accuracy-centric baseline.
DCMay 12
GraphFlash: Enabling Fast and Elastic Graph Processing on Serverless InfrastructureChen Zhao, Parsa Poorsistani, Mohammad Goudarzi et al.
Graph processing systems are essential for analyzing large-scale data with complex relationships, yet most existing frameworks rely on statically provisioned clusters, resulting in poor elasticity and inefficient resource utilization under dynamic workloads. Serverless computing offers automatic scaling and fine-grained billing, but existing serverless graph systems suffer from performance limitations due to inefficient state management and high communication overhead through external storage. We present GraphFlash, a fast and elastic graph processing framework built on serverless infrastructure. GraphFlash adopts a subgraph-centric programming model and leverages shared external storage for coordination and communication, enabling stateless, fine-grained function execution. It supports two execution modes: rotating mode for resource-constrained environments and pinned mode for higher performance when resources are sufficient. To address serverless limitations, GraphFlash introduces system-level optimizations, including partition-aware key aggregation, intra-function partition co-location, and superstep-aware activation. Across multiple graph algorithms and datasets, GraphFlash outperforms existing serverless-compatible systems by up to 127x in execution time and reduces resource consumption by up to 98% under higher-resource configurations, while matching the performance of traditional distributed frameworks on large workloads. Even with limited resources, it achieves up to 48x speedup and 99.97% cost reduction over prior serverless solutions, demonstrating that GraphFlash makes serverless graph processing practical and performant.
DCAug 13, 2021Code
Digital Twin of a Cloud Data Centre: OpenStack Cluster VisualisationSheridan Gomes, Adel N. Toosi, Barrett Ens
Data centres in contemporary times are essential as the supply of data increases. Data centres are areas where computing systems are concentrated for facilitating data processing, transfer and storage. At present traditional data centres have moved more towards the cloud model thereby making the processing, storage and harnessing of data more manageable and more accessible via the utility and subscription-based model of computing services. From the administrative point of view, cloud data centres are complex systems, hard to grasp and require large amounts of time to analyse different aspects of the cloud data centre such as maintenance and resource management. For a cloud data centre admin, this could be a challenging problem and a highly time-consuming task. Accordingly, there is a need to improve the useability of cloud data centre monitoring and management tools, and the digital twin could fulfil this need. This paper's primary objective is to construct a digital twin - a 3D visualisation and monitoring tool - of a cloud data centre managed by OpenStack, the well-known open-source cloud computing infrastructure software. To evaluate our proposed tool, we garner feedback on the digital twin's useability compared to the OpenStack dashboard. The input will be received from cloud data centres experts as they test the digital twin and answer various questions in an interview. The study results show that our proposed Digital Twin will help data centre admins better monitor and manage their data centres. It also will facilitate further research and implementation of the digital twin of data centres to improve usability.
DCJul 8, 2025
ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the EdgeDaghash K. Alqahtani, Maria A. Rodriguez, Muhammad Aamir Cheema et al.
Edge computing enables data processing closer to the source, significantly reducing latency, an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource-constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies, including a novel estimation-based techniques and an innovative greedy selection algorithm, to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our framework through extensive experiments on real-world datasets, comparing against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 35% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.
AIJun 25, 2025
Smart Ride and Delivery Services with Electric Vehicles: Leveraging Bidirectional Charging for Profit OptimisationJinchun Du, Bojie Shen, Muhammad Aamir Cheema et al.
With the rising popularity of electric vehicles (EVs), modern service systems, such as ride-hailing delivery services, are increasingly integrating EVs into their operations. Unlike conventional vehicles, EVs often have a shorter driving range, necessitating careful consideration of charging when fulfilling requests. With recent advances in Vehicle-to-Grid (V2G) technology - allowing EVs to also discharge energy back to the grid - new opportunities and complexities emerge. We introduce the Electric Vehicle Orienteering Problem with V2G (EVOP-V2G): a profit-maximization problem where EV drivers must select customer requests or orders while managing when and where to charge or discharge. This involves navigating dynamic electricity prices, charging station selection, and route constraints. We formulate the problem as a Mixed Integer Programming (MIP) model and propose two near-optimal metaheuristic algorithms: one evolutionary (EA) and the other based on large neighborhood search (LNS). Experiments on real-world data show our methods can double driver profits compared to baselines, while maintaining near-optimal performance on small instances and excellent scalability on larger ones. Our work highlights a promising path toward smarter, more profitable EV-based mobility systems that actively support the energy grid.
CVMay 23, 2024
A motion-based compression algorithm for resource-constrained video camera trapsMalika Nisal Ratnayake, Lex Gallon, Adel N. Toosi et al.
Field-captured video facilitates detailed studies of spatio-temporal aspects of animal locomotion, decision-making and environmental interactions including predator-prey relationships and habitat utilisation. But even though data capture is cheap with mass-produced hardware, storage, processing and transmission overheads provide a hurdle to acquisition of high resolution video from field-situated edge computing devices. Efficient compression algorithms are therefore essential if monitoring is to be conducted on single-board computers in situations where such hurdles must be overcome. Animal motion tracking in the field has unique characteristics that necessitate the use of novel video compression techniques, which may be underexplored or unsuitable in other contexts. In this article, we therefore introduce a new motion analysis-based video compression algorithm specifically designed for camera traps. We implemented and tested this algorithm using a case study of insect-pollinator motion tracking on three popular edge computing platforms. The algorithm identifies and stores only image regions depicting motion relevant to pollination monitoring, reducing overall data size by an average of 87% across diverse test datasets. Our experiments demonstrate the algorithm's capability to preserve critical information for insect behaviour analysis through both manual observation and automatic analysis of the compressed footage. The method presented in this paper enhances the applicability of low-powered computer vision edge devices to remote, in situ animal motion monitoring, and improves the efficiency of playback during behavioural analyses. Our new software, EcoMotionZip, is available Open Access.
DBJun 15, 2020
Comparing Alternative Route Planning Techniques: A Comparative User Study on Melbourne, Dhaka and Copenhagen Road NetworksLingxiao Li, Muhammad Aamir Cheema, Hua Lu et al.
Many modern navigation systems and map-based services do not only provide the fastest route from a source location s to a target location t but also provide a few alternative routes to the users as more options to choose from. Consequently, computing alternative paths has received significant research attention. However, it is unclear which of the existing approaches generates alternative routes of better quality because the quality of these alternatives is mostly subjective. Motivated by this, in this paper, we present a user study conducted on the road networks of Melbourne, Dhaka and Copenhagen that compares the quality (as perceived by the users) of the alternative routes generated by four of the most popular existing approaches including the routes provided by Google Maps. We also present a web-based demo system that can be accessed using any internet-enabled device and allows users to see the alternative routes generated by the four approaches for any pair of selected source and target. We report the average ratings received by the four approaches and our statistical analysis shows that there is no credible evidence that the four approaches receive different ratings on average. We also discuss the limitations of this user study and recommend the readers to interpret these results with caution because certain factors may have affected the participants' ratings.