A Systematic Approach to Blocking Convolutional Neural Networks
This work addresses performance bottlenecks in CNN implementations for computer vision applications, offering incremental improvements in efficiency for hardware and software systems.
The paper tackles the problem of optimizing convolutional neural networks (CNNs) for memory locality by developing an analytical model to automatically derive blockings, resulting in up to an order of magnitude improvement in energy efficiency for custom hardware and up to 90% reduction in memory accesses for x86 CPU implementations compared to hand-optimized methods.
Convolutional Neural Networks (CNNs) are the state of the art solution for many computer vision problems, and many researchers have explored optimized implementations. Most implementations heuristically block the computation to deal with the large data sizes and high data reuse of CNNs. This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests. Using this model we automatically derive optimized blockings for common networks that improve the energy efficiency of custom hardware implementations by up to an order of magnitude. Compared to traditional CNN CPU implementations based on highly-tuned, hand-optimized BLAS libraries,our x86 programs implementing the optimal blocking reduce the number of memory accesses by up to 90%.