CVNov 1, 2018

Balanced Sparsity for Efficient DNN Inference on GPU

arXiv:1811.00206v499 citations
Originality Incremental advance
AI Analysis

This addresses the need for efficient deployment of deep learning services on commercial hardware, offering a practical solution that balances speed and accuracy, though it is incremental relative to existing sparsity methods.

The paper tackles the problem of accelerating deep neural network inference on GPUs without sacrificing accuracy by proposing balanced sparsity, which achieves up to 3.1x speedup while maintaining high model accuracy comparable to fine-grained sparsity.

In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference on general-purpose hardwares by adopting coarse-grained sparsity to prune or regularize consecutive weights for efficient computation. But this method often sacrifices model accuracy. In this paper, we propose a novel fine-grained sparsity approach, balanced sparsity, to achieve high model accuracy with commercial hardwares efficiently. Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. Experiment results show that balanced sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as fine-grained sparsity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes