Analytical Characterization and Design Space Exploration for Optimization of CNNs
This work addresses the bottleneck of data movement in CNNs for machine learning practitioners, though it is incremental as it builds on existing loop-level optimization techniques.
The paper tackles the problem of optimizing convolutional neural networks (CNNs) on multi-core CPUs by developing an analytical modeling approach to find the best loop-level optimization configurations, achieving comparable or better performance than state-of-the-art libraries and auto-tuning optimizers.
Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement. However, the search space for finding the best loop-level optimization configuration is explosively large. This paper develops an analytical modeling approach for finding the best loop-level optimization configuration for CNNs on multi-core CPUs. Experimental evaluation shows that this approach achieves comparable or better performance than state-of-the-art libraries and auto-tuning based optimizers for CNNs.