Zeynep Kiziltan

AI
h-index21
7papers
268citations
Novelty49%
AI Score40

7 Papers

DCJun 30, 2023
Online Job Failure Prediction in an HPC System

Francesco Antici, Andrea Borghesi, Zeynep Kiziltan

Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system. The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Experimental results show that our approach is promising.

AIFeb 5
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

Hiari Pizzini Cavagna, Andrea Proia, Giacomo Madella et al.

Large Language Models (LLMs) inference is central to modern AI applications, dominating worldwide datacenter workloads, making it critical to predict its energy footprint. Existing approaches estimate energy consumption as a simple linear function of input and output sequence. However, by analyzing the autoregressive structure of Transformers, which implies a fundamentally non-linear relationship between input and output sequence lengths and energy consumption, we demonstrate the existence of a generation energy minima. Peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs. Consequently, we propose SweetSpot, an analytical model derived from the computational and memory-access complexity of the Transformer architecture, which accurately characterizes the efficiency curve as a function of input and output lengths. To assess accuracy, we measure energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite. We test input and output lengths from 64 to 4096 tokens and achieve a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "sweet spots" reduce energy usage, up to 33.41x, enabling informed truncation, summarization, and adaptive generation strategies in production systems.

AISep 23, 2024
Automatic Feature Learning for Essence: a Case Study on Car Sequencing

Alessio Pellegrino, Özgür Akgün, Nguyen Dang et al.

Constraint modelling languages such as Essence offer a means to describe combinatorial problems at a high-level, i.e., without committing to detailed modelling decisions for a particular solver or solving paradigm. Given a problem description written in Essence, there are multiple ways to translate it to a low-level constraint model. Choosing the right combination of a low-level constraint model and a target constraint solver can have significant impact on the effectiveness of the solving process. Furthermore, the choice of the best combination of constraint model and solver can be instance-dependent, i.e., there may not exist a single combination that works best for all instances of the same problem. In this paper, we consider the task of building machine learning models to automatically select the best combination for a problem instance. A critical part of the learning process is to define instance features, which serve as input to the selection model. Our contribution is automatic learning of instance features directly from the high-level representation of a problem instance using a language model. We evaluate the performance of our approach using the Essence modelling language with a case study involving the car sequencing problem.

AIFeb 26, 2022
TabID: Automatic Identification and Tabulation of Subproblems in Constraint Models

Özgür Akgün, Ian P. Gent, Christopher Jefferson et al.

The performance of a constraint model can often be improved by converting a subproblem into a single table constraint (referred to as tabulation). Finding subproblems to tabulate is traditionally a manual and time-intensive process, even for expert modellers. This paper presents TabID, an entirely automated method to identify promising subproblems for tabulation in constraint programming. We introduce a diverse set of heuristics designed to identify promising candidates for tabulation, aiming to improve solver performance. These heuristics are intended to encapsulate various factors that contribute to useful tabulation. We also present additional checks to limit the potential drawbacks of suboptimal tabulation. We comprehensively evaluate our approach using benchmark problems from existing literature that previously relied on manual identification by constraint programming experts of constraints to tabulate. We demonstrate that our automated identification and tabulation process achieves comparable, and in some cases improved results. We empirically evaluate the efficacy of our approach on a variety of solvers, including standard CP (Minion and Gecode), clause-learning CP (Chuffed and OR-Tools) and SAT solvers (Kissat). Our findings highlight the substantial potential of fully automated tabulation, suggesting its integration into automated model reformulation tools.

AISep 22, 2020
A Constraint Programming-based Job Dispatcher for Modern HPC Systems and Applications

Cristian Galleguillos, Zeynep Kiziltan, Ricardo Soto

Constraint Programming (CP) is a well-established area in AI as a programming paradigm for modelling and solving discrete optimization problems, and it has been been successfully applied to tackle the on-line job dispatching problem in HPC systems including those running modern applications. The limitations of the available CP-based job dispatchers may hinder their practical use in today's systems that are becoming larger in size and more demanding in resource allocation. In an attempt to bring basic AI research closer to a deployed application, we present a new CP-based on-line job dispatcher for modern HPC systems and applications. Unlike its predecessors, our new dispatcher tackles the entire problem in CP and its model size is independent of the system size. Experimental results based on a simulation study show that with our approach dispatching performance increases significantly in a large system and in a system where allocation is nontrivial.

DCJul 27, 2020
A Machine Learning Approach to Online Fault Classification in HPC Systems

Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu et al.

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.

AIOct 3, 2019
A Commentary on "Breaking Row and Column Symmetries in Matrix Models"

Alan M. Frisch, Brahim Hnich, Zeynep Kiziltan et al.

The CP 2002 paper entitled "Breaking Row and Column Symmetries in Matrix Models" by Flener et al. (https://link.springer.com/chapter/10.1007%2F3-540-46135-3_31) describes some of the first work for identifying and analyzing row and column symmetry in matrix models and for efficiently and effectively dealing with such symmetry using static symmetry-breaking ordering constraints. This commentary provides a retrospective on that work and highlights some of the subsequent work on the topic.