Common Subexpression-based Compression and Multiplication of Sparse Constant Matrices
This work addresses the problem of deploying compressed deep learning models on low-cost embedded devices by improving scalability and performance, though it is incremental as it builds on existing CSE and compression techniques.
The paper tackles the inefficiency of existing common subexpression elimination (CSE) methods for compressing sparse constant matrices in deep learning inference, proposing a random search-based algorithm that extracts CSEs in a 1000x1000 matrix in a minute and a compression format achieving over 50% compression rates and 20% faster matrix multiplication on embedded systems.
In deep learning inference, model parameters are pruned and quantized to reduce the model size. Compression methods and common subexpression (CSE) elimination algorithms are applied on sparse constant matrices to deploy the models on low-cost embedded devices. However, the state-of-the-art CSE elimination methods do not scale well for handling large matrices. They reach hours for extracting CSEs in a $200 \times 200$ matrix while their matrix multiplication algorithms execute longer than the conventional matrix multiplication methods. Besides, there exist no compression methods for matrices utilizing CSEs. As a remedy to this problem, a random search-based algorithm is proposed in this paper to extract CSEs in the column pairs of a constant matrix. It produces an adder tree for a $1000 \times 1000$ matrix in a minute. To compress the adder tree, this paper presents a compression format by extending the Compressed Sparse Row (CSR) to include CSEs. While compression rates of more than $50\%$ can be achieved compared to the original CSR format, simulations for a single-core embedded system show that the matrix multiplication execution time can be reduced by $20\%$.