Deep Learning Approximation: Zero-Shot Neural Network Speedup
This addresses the need for faster neural network deployment in production systems, offering a practical solution for efficiency, though it is incremental as it builds on existing optimization methods.
The paper tackles the problem of high computational cost in neural network inference by proposing Deep Learning Approximation, a technique that speeds up trained networks without retraining, achieving a 2x speedup on YOLO with a 5% mAP drop that can be recovered by finetuning.
Neural networks offer high-accuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called Deep Learning Approximation to build a faster network in a tiny fraction of the time required for training by only manipulating the network structure and coefficients without requiring re-training or access to the training data. Speedup is achieved by by applying a sequential series of independent optimizations that reduce the floating-point operations (FLOPs) required to perform a forward pass. First, lossless optimizations are applied, followed by lossy approximations using singular value decomposition (SVD) and low-rank matrix decomposition. The optimal approximation is chosen by weighing the relative accuracy loss and FLOP reduction according to a single parameter specified by the user. On PASCAL VOC 2007 with the YOLO network, we show an end-to-end 2x speedup in a network forward pass with a 5% drop in mAP that can be re-gained by finetuning.