Efficient Identification of Butterfly Sparse Matrix Factorizations
This provides a theoretical foundation for uniquely identifying butterfly matrix factorizations, which is incremental but addresses a specific bottleneck in fast transforms and neural network compression.
The paper tackles the problem of uniquely factorizing matrices with butterfly structure into sparse factors, proving that any N×N butterfly matrix admits an essentially unique factorization into J factors (where N=2^J) and can be recovered via a hierarchical method. This enables efficient computation with O(N²) cost for factorization and O(N log N) for matrix-vector multiplication, applicable to transforms like Hadamard or Fourier and potentially compressing deep neural networks.
Fast transforms correspond to factorizations of the form $\mathbf{Z} = \mathbf{X}^{(1)} \ldots \mathbf{X}^{(J)}$, where each factor $ \mathbf{X}^{(\ell)}$ is sparse and possibly structured. This paper investigates essential uniqueness of such factorizations, i.e., uniqueness up to unavoidable scaling ambiguities. Our main contribution is to prove that any $N \times N$ matrix having the so-called butterfly structure admits an essentially unique factorization into $J$ butterfly factors (where $N = 2^{J}$), and that the factors can be recovered by a hierarchical factorization method, which consists in recursively factorizing the considered matrix into two factors. This hierarchical identifiability property relies on a simple identifiability condition in the two-layer and fixed-support setting. This approach contrasts with existing ones that fit the product of butterfly factors to a given matrix via gradient descent. The proposed method can be applied in particular to retrieve the factorization of the Hadamard or the discrete Fourier transform matrices of size $N=2^J$. Computing such factorizations costs $\mathcal{O}(N^{2})$, which is of the order of dense matrix-vector multiplication, while the obtained factorizations enable fast $\mathcal{O}(N \log N)$ matrix-vector multiplications and have the potential to be applied to compress deep neural networks.