LGQMMLJan 18, 2019

Estimating the effective dimension of large biological datasets using Fisher separability analysis

arXiv:1901.06328v148 citationsHas Code
Originality Incremental advance
AI Analysis

This provides a tool for researchers analyzing high-dimensional biological data, though it is incremental as it builds on existing dimensionality estimation techniques.

The paper tackled the problem of estimating intrinsic dimensionality in large biological datasets, showing that a Fisher separability-based estimator performs competitively with state-of-the-art methods, particularly in noisy samples and without requiring a manifold assumption.

Modern large-scale datasets are frequently said to be high-dimensional. However, their data point clouds frequently possess structures, significantly decreasing their intrinsic dimensionality (ID) due to the presence of clusters, points being located close to low-dimensional varieties or fine-grained lumping. We test a recently introduced dimensionality estimator, based on analysing the separability properties of data points, on several benchmarks and real biological datasets. We show that the introduced measure of ID has performance competitive with state-of-the-art measures, being efficient across a wide range of dimensions and performing better in the case of noisy samples. Moreover, it allows estimating the intrinsic dimension in situations where the intrinsic manifold assumption is not valid.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes