Bandwidth-efficient Inference for Neural Image Compression
This addresses bandwidth and energy efficiency for neural network deployment on resource-constrained mobile/edge devices, representing an incremental improvement in optimization techniques.
The paper tackles the problem of limited communication bandwidth and power constraints for neural network inference on mobile/edge devices by proposing an end-to-end differentiable bandwidth-efficient inference method with activation compression. The result is up to 19x bandwidth reduction and 6.21x energy saving for low-level image compression tasks.
With neural networks growing deeper and feature maps growing larger, limited communication bandwidth with external memory (or DRAM) and power constraints become a bottleneck in implementing network inference on mobile and edge devices. In this paper, we propose an end-to-end differentiable bandwidth efficient neural inference method with the activation compressed by neural data compression method. Specifically, we propose a transform-quantization-entropy coding pipeline for activation compression with symmetric exponential Golomb coding and a data-dependent Gaussian entropy model for arithmetic coding. Optimized with existing model quantization methods, low-level task of image compression can achieve up to 19x bandwidth reduction with 6.21x energy saving.