High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures
This work addresses performance bottlenecks in convolution operations for deep learning practitioners, though it is incremental as it builds on existing convolution methods with new layouts and optimizations.
The paper tackled the problem of inefficient convolution operations in deep neural networks by proposing three novel tensor data layouts and optimization techniques for im2win and direct convolutions on SIMD architectures, resulting in up to 355% speedup and achieving up to 95% of the machine's theoretical peak performance.
Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.