DCLGMay 20

Instant GPU Efficiency Visibility at Fleet Scale

arXiv:2605.2079928.5
AI Analysis

For operators of large-scale GPU fleets, OFU provides a practical, deployment-ready metric for continuous efficiency monitoring without application instrumentation.

The authors propose Overall FLOP Utilization (OFU), a hardware-level GPU efficiency metric derived from two on-chip performance counters, which predicts application-level MFU to within ≤2 percentage points after correction, achieves r=0.78 correlation with MFU across 608 production jobs, and detected a 2.5x efficiency regression in fleet-wide monitoring.

We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation -- tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting -- through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale GPU fleets, OFU has detected a 2.5x efficiency regression and tracked precision-dependent utilization changes in mixed-precision pretraining. Our evaluation and operational experience suggest OFU is a practical, deployment-ready complement to application-level MFU for continuous fleet-wide efficiency monitoring.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes