Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial
This addresses efficiency challenges in GPU-based AI training for researchers and industry, but appears incremental as it builds on existing benchmarks like MLPerf.
The study analyzed GPU scalability efficiency for training large-scale deep learning models, finding configurations that optimize performance, GPU usage, and efficiency, with results indicating a break-even point to reduce training times while maximizing efficiency.
Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.