LGMSApr 17, 2025

NNTile: a machine learning framework capable of training extremely large GPT language models on a single node

arXiv:2504.13236v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of efficiently training large language models on limited hardware for researchers and practitioners, though it is incremental as it builds on existing task-based parallelism methods.

The study introduces the NNTile framework, which uses task-based parallelism via StarPU to automatically schedule operations across CPUs and GPUs, enabling training of extremely large GPT models on a single node, as demonstrated in numerical experiments.

This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes