CVDCNENov 20, 2016

Deep Tensor Convolution on Multicores

arXiv:1611.06565v339 citations
Originality Highly original
AI Analysis

This work addresses the performance bottleneck for video and volumetric image analysis on CPU hardware, offering a significant speedup over existing methods.

The paper tackled the problem of slow CPU implementations for deep convolutional neural networks with 3D kernels by extending and optimizing Winograd-class algorithms for N-dimensional convolutions on CPU hardware, achieving a 5 to 25-fold improvement in throughput compared to previous state-of-the-art.

Deep convolutional neural networks (ConvNets) of 3-dimensional kernels allow joint modeling of spatiotemporal features. These networks have improved performance of video and volumetric image analysis, but have been limited in size due to the low memory ceiling of GPU hardware. Existing CPU implementations overcome this constraint but are impractically slow. Here we extend and optimize the faster Winograd-class of convolutional algorithms to the $N$-dimensional case and specifically for CPU hardware. First, we remove the need to manually hand-craft algorithms by exploiting the relaxed constraints and cheap sparse access of CPU memory. Second, we maximize CPU utilization and multicore scalability by transforming data matrices to be cache-aware, integer multiples of AVX vector widths. Treating 2-dimensional ConvNets as a special (and the least beneficial) case of our approach, we demonstrate a 5 to 25-fold improvement in throughput compared to previous state-of-the-art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes