A Practical Mode-parallel Implementation of the (H-)Tucker Decomposition via Randomization
This work addresses efficiency bottlenecks in tensor decomposition for data analysis, offering incremental improvements in parallelization for researchers and practitioners in computational mathematics and machine learning.
The paper tackles the high computational and memory demands of Tucker and H-Tucker tensor decompositions for high-dimensional data by proposing a mode-parallel implementation using randomization techniques, resulting in reduced running time and storage requirements with good scaling in HPC environments.
In the last decades, tensors have emerged as the right tool to represent multidimensional data in a compact yet informative manner. Moreover, it is well-known that by performing low-rank factorizations of such tensors one is often able to effectively unveil possible hidden structure in data, mainly due to unexpected dependencies among the different variables encoded in the given tensor. However, computing these factorizations is extremely energy-consuming and memory-demanding, especially for high-dimensional tensors, namely those with a large number of modes. In this paper we focus on two state-of-the-art tensor decompositions: the Tucker and H-Tucker decompositions. We propose novel numerical strategies able to perform these factorizations in a \emph{mode-parallel} fashion, that is the operations required by the algorithm along all modes are performed in parallel. This is in contrast to what is achieved by many procedures available in the literature that parallelize some of the operations along each mode, e.g., tensor-times-matrix steps, while still visiting one mode at the time in a sequential manner. Our strategies make use of cutting-edge randomization techniques comprising fiber sampling and randomized range-finding steps. We provide upper bounds on the expected value of the error provided by our factorizations while a panel of numerical results showcases the potential of our approach in reducing both the running time and the storage demand of the whole procedure. Moreover, experiments carried out in HPC environments illustrate the good scaling of our mode-parallel approach.