DCApr 28, 2022
RISCLESS: A Reinforcement Learning Strategy to Exploit Unused Cloud ResourcesSidahmed Yalles, Mohamed Handaoui, Jean-Emile Dartois et al.
One of the main objectives of Cloud Providers (CP) is to guarantee the Service-Level Agreement (SLA) of customers while reducing operating costs. To achieve this goal, CPs have built large-scale datacenters. This leads, however, to underutilized resources and an increase in costs. A way to improve the utilization of resources is to reclaim the unused parts and resell them at a lower price. Providing SLA guarantees to customers on reclaimed resources is a challenge due to their high volatility. Some state-of-the-art solutions consider keeping a proportion of resources free to absorb sudden variation in workloads. Others consider stable resources on top of the volatile ones to fill in for the lost resources. However, these strategies either reduce the amount of reclaimable resources or operate on less volatile ones such as Amazon Spot instance. In this paper, we proposed RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources. Our approach consists of using a small proportion of stable on-demand resources alongside the ephemeral ones in order to guarantee customers SLA and reduce the overall costs. The approach decides when and how much stable resources to allocate in order to fulfill customers' demands. RISCLESS improved the CPs' profits by an average of 15.9% compared to state-of-the-art strategies. It also reduced the SLA violation time by an average of 36.7% while increasing the amount of used ephemeral resources by 19.5% on average
LGFeb 4
RETENTION: Resource-Efficient Tree-Based Ensemble Model Acceleration with Content-Addressable MemoryYi-Chun Liao, Chieh-Lin Tsai, Yuan-Hao Chang et al.
Although deep learning has demonstrated remarkable capability in learning from unstructured data, modern tree-based ensemble models remain superior in extracting relevant information and learning from structured datasets. While several efforts have been made to accelerate tree-based models, the inherent characteristics of the models pose significant challenges for conventional accelerators. Recent research leveraging content-addressable memory (CAM) offers a promising solution for accelerating tree-based models, yet existing designs suffer from excessive memory consumption and low utilization. This work addresses these challenges by introducing RETENTION, an end-to-end framework that significantly reduces CAM capacity requirement for tree-based model inference. We propose an iterative pruning algorithm with a novel pruning criterion tailored for bagging-based models (e.g., Random Forest), which minimizes model complexity while ensuring controlled accuracy degradation. Additionally, we present a tree mapping scheme that incorporates two innovative data placement strategies to alleviate the memory redundancy caused by the widespread use of don't care states in CAM. Experimental results show that implementing the tree mapping scheme alone reduces CAM capacity requirement by $1.46\times$ to $21.30 \times$, while the full RETENTION framework achieves $4.35\times$ to $207.12\times$ reduction with less than 3\% accuracy loss. These results demonstrate that RETENTION is highly effective in minimizing CAM resource demand, providing a resource-efficient direction for tree-based model acceleration.
34.4LGMar 24
APreQEL: Adaptive Mixed Precision Quantization For Edge LLMsMeriem Bouzouad, Yuan-Hao Chang, Jalil Boukhobza
Today, large language models have demonstrated their strengths in various tasks ranging from reasoning, code generation, and complex problem solving. However, this advancement comes with a high computational cost and memory requirements, making it challenging to deploy these models on edge devices to ensure real-time responses and data privacy. Quantization is one common approach to reducing memory use, but most methods apply it uniformly across all layers. This does not account for the fact that different layers may respond differently to reduced precision. Importantly, memory consumption and computational throughput are not necessarily aligned, further complicating deployment decisions. This paper proposes an adaptive mixed precision quantization mechanism that balances memory, latency, and accuracy in edge deployment under user-defined priorities. This is achieved by analyzing the layer-wise contribution and by inferring how different quantization types behave across the target hardware platform in order to assign the most suitable quantization type to each layer. This integration ensures that layer importance and the overall performance trade-offs are jointly respected in this design. Our work unlocks new configuration designs that uniform quantization cannot achieve, expanding the solution space to efficiently deploy the LLMs on resource-constrained devices.
PFSep 23, 2020
ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud ResourcesMohamed Handaoui, Jean-Emile Dartois, Jalil Boukhobza et al.
Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.
ARSep 10, 2013
Evaluation of the Performance/Energy Overhead in DSP Video Decoding and its ImplicationsYahia Benmoussa, Jalil Boukhobza, Eric Senn et al.
Video decoding is considered as one of the most compute and energy intensive application in energy constrained mobile devices. Some specific processing units, such as DSPs, are added to those devices in order to optimize the performance and the energy consumption. However, in DSP video decoding, the inter-processor communication overhead may have a considerable impact on the performance and the energy consumption. In this paper, we propose to evaluate this overhead and analyse its impact on the performance and the energy consumption as compared to the GPP decoding. Our work revealed that the GPP can be the best choice in many cases due to the a significant overhead in DSP decoding which may represents 30% of the total decoding energy.
MMAug 31, 2012
Behavioral Systel Level Power Consumption Modeling of Mobile Video Streaming applicationsYahia Benmoussa, Jalil Boukhobza, Yassine Hadjadj Aoul et al.
Nowadays, the use of mobile applications and terminals faces fundamental challenges related to energy constraint. This is due to the limited battery lifetime as compared to the increasing hardware evolution. Video streaming is one of the most energy consuming applications in a mobile system because of its intensive use of bandwidth, memory and processing power. In this work, we aim to propose a methodology for building and validating a high level global power consumption model including a hardware and software elements. Our approach is based on exploiting the interactions between power consumption sub-models of standalone systems in the perspective to build more accurate global model. The interactions are studied within the exclusive context of video streaming applications that are one of the most used mobile applications.