Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs
This work provides a faster t-SNE tool for data scientists and researchers dealing with large datasets, though it is incremental as it optimizes an existing algorithm.
The paper tackles the inefficiency of existing CPU implementations of the Barnes-Hut t-SNE algorithm for visualizing high-dimensional data, achieving speedups of up to 261x over scikit-learn and 4x over a state-of-the-art implementation on a 32-core system.
t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data. Most standard packages of t-SNE, such as scikit-learn, use the Barnes-Hut t-SNE (BH t-SNE) algorithm for large datasets. However, existing CPU implementations of this algorithm are inefficient. In this work, we accelerate the BH t-SNE on CPUs via cache optimizations, SIMD, parallelizing sequential steps, and improving parallelization of multithreaded steps. Our implementation (Acc-t-SNE) is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py, respectively, on a 32-core Intel(R) Icelake cloud instance.