Hoang-Loc La

h-index2

3papers

4citations

3 Papers

8.7DCJul 9

Empirical Analysis of GPU Frequency Behavior Under ML Workloads

Truong-Thanh Le, Hoang-Loc La, Amir Taherkordi et al.

This work presents ongoing research on the frequency scaling behavior of NVIDIA GPUs when executing ML/AI workloads. Our preliminary findings show that, on lower-performance GPUs, the operating frequency is strongly affected by the recent workload history, typically within an 80ms window. This behavior challenges a common assumption underlying several state-of-the-art ML latency-prediction techniques, which treat individual GPU kernel latencies as independent and therefore estimate total execution time by summing isolated per-kernel measurements. Our results indicate that such an assumption does not always hold, as the GPU's dynamic frequency scaling introduces inter-kernel dependencies. We also outline several promising directions for leveraging this observation in future work, including improved latency-prediction models, GPU kernel-reordering strategies, and NAS-driven guidelines for frequency/latency/energy-aware model design.

8.4DCJun 2

E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

Truong-Thanh Le, Amir Taherkordi, Hoang-Loc La et al.

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

4.1LGApr 11, 2025

Kernel-Level Energy-Efficient Neural Architecture Search for Tabular Dataset

Hoang-Loc La, Phuong Hoai Ha

Many studies estimate energy consumption using proxy metrics like memory usage, FLOPs, and inference latency, with the assumption that reducing these metrics will also lower energy consumption in neural networks. This paper, however, takes a different approach by introducing an energy-efficient Neural Architecture Search (NAS) method that directly focuses on identifying architectures that minimize energy consumption while maintaining acceptable accuracy. Unlike previous methods that primarily target vision and language tasks, the approach proposed here specifically addresses tabular datasets. Remarkably, the optimal architecture suggested by this method can reduce energy consumption by up to 92% compared to architectures recommended by conventional NAS.