Simulating Performance of ML Systems with Offline Profiling
This work addresses the challenge of optimizing ML system performance for developers and researchers, but it appears incremental as it builds on existing profiling and simulation techniques.
The authors tackled the problem of understanding and improving complex ML systems by proposing a simulation approach based on offline profiling, which uses operation-level profiling and dataflow-based simulation to offer a unified, automated, and accurate solution across frameworks and models.
We advocate that simulation based on offline profiling is a promising approach to better understand and improve the complex ML systems. Our approach uses operation-level profiling and dataflow based simulation to ensure it offers a unified and automated solution for all frameworks and ML models, and is also accurate by considering the various parallelization strategies in a real system.