CVMay 14, 2020

PENNI: Pruned Kernel Sharing for Efficient CNN Inference

Shiyu Li, Edward Hanson, Hai Li, Yiran Chen

arXiv:2005.07133v210.123 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient CNN inference for deployment on resource-constrained devices, representing an incremental improvement over previous low-rank approximation methods by better exploiting model redundancy.

The paper tackles the problem of deploying state-of-the-art CNNs on resource-constrained devices by proposing PENNI, a compression framework that achieves 97% parameter pruning and 92% FLOP reduction on ResNet18 CIFAR10 with no accuracy loss, while reducing run-time memory by 44% and inference latency by 53%.

Although state-of-the-art (SOTA) CNNs achieve outstanding performance on various tasks, their high computation demand and massive number of parameters make it difficult to deploy these SOTA CNNs onto resource-constrained devices. Previous works on CNN acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. Experiments show that we can prune 97% parameters and 92% FLOPs on ResNet18 CIFAR10 with no accuracy loss, and achieve 44% reduction in run-time memory consumption and a 53% reduction in inference latency.

View on arXiv PDF Code

Similar