LG AIFeb 2

Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

arXiv:2602.02632v1Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of resource-efficient pretraining for academic users with limited hardware, offering incremental improvements in distributed training techniques.

The study tackled the challenge of efficiently pretraining small language models on limited datasets using commodity GPUs, finding that Alpa's execution plans optimized for parallelism performed best under network latencies of 10's of milliseconds, reducing execution time and GPU usage.

Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws of LLMs. Using pretrained models, vector embeddings can be generated for raw data and stored using vector databases to support modern AI applications and semantic search. In this work, we investigate the performance of pretraining techniques for smaller-sized LLMs on an experimental testbed (with commodity GPUs) available to academic users at no charge. We consider data parallelism, intra-operator parallelism, and inter-operator/pipeline parallelism, and their combinations for pretraining. We set up different GPU clusters with homogeneous and heterogeneous GPU hardware. Furthermore, we investigate the impact of network latency on pretraining performance especially when GPUs are geographically distributed. We used GPT-2 medium and large models and pretrained them using open-source packages, namely, Alpa and Ray. We observed that Alpa's execution plans that collectively optimized intra-operator and inter-operator/pipeline parallelism consistently performed the best when GPUs were geographically distributed. This was especially true when the network latencies were in 10's of milliseconds. Based on the insights gained from the experiments, we propose a systematic approach for selecting the appropriate pretraining technique to achieve high training performance/lower execution time as well as to reduce the number of GPUs used.

View on arXiv PDF

Similar