LGQMDec 24, 2018

bigMap: Big Data Mapping with Parallelized t-SNE

arXiv:1812.09869v2
Originality Synthesis-oriented
AI Analysis

This work addresses the need for efficient clustering of large datasets, but it is incremental as it builds on existing methods like t-SNE and watershed algorithms.

The authors tackled the problem of unsupervised clustering for large-scale structured data by introducing a three-step protocol that includes a parallelized t-SNE for dimensionality reduction, an adaptive kernel density estimation, and a watershed algorithm for segmentation, resulting in the bigMap R package with tools for assessment.

We introduce an improved unsupervised clustering protocol specially suited for large-scale structured data. The protocol follows three steps: a dimensionality reduction of the data, a density estimation over the low dimensional representation of the data, and a final segmentation of the density landscape. For the dimensionality reduction step we introduce a parallelized implementation of the well-known t-Stochastic Neighbouring Embedding (t-SNE) algorithm that significantly alleviates some inherent limitations, while improving its suitability for large datasets. We also introduce a new adaptive Kernel Density Estimation particularly coupled with the t-SNE framework in order to get accurate density estimates out of the embedded data, and a variant of the rainfalling watershed algorithm to identify clusters within the density landscape. The whole mapping protocol is wrapped in the bigMap R package, together with visualization and analysis tools to ease the qualitative and quantitative assessment of the clustering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes