Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure
This is an incremental technical report for enterprises and labs facing infrastructure scaling issues.
The report addresses the challenge of scaling GPU infrastructure for data science applications, detailing the decision process and implementation of an on-premises, scalable GPU cluster system.
Enterprises and labs performing computationally expensive data science applications sooner or later face the problem of scale but unconnected infrastructure. For this up-scaling process, an IT service provider can be hired or in-house personnel can attempt to implement a software stack. The first option can be quite expensive if it is just about connecting several machines. For the latter option often experience is missing with the data science staff in order to navigate through the software jungle. In this technical report, we illustrate the decision process towards an on-premises infrastructure, our implemented system architecture, and the transformation of the software stack towards a scaleable GPU cluster system.