Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension Estimation
This work addresses the challenge of manifold dimension estimation for researchers in unsupervised and semi-supervised learning, though it appears incremental as it builds on existing geometric methods.
The paper tackles the problem of verifying the manifold hypothesis and estimating the underlying manifold dimension by proposing a new approach that combines a geometric method (modified box-counting) and a novel probabilistic method, showing it to be powerful and effective on real datasets.
Manifold hypothesis states that data points in high-dimensional space actually lie in close vicinity of a manifold of much lower dimension. In many cases this hypothesis was empirically verified and used to enhance unsupervised and semi-supervised learning. Here we present new approach to manifold hypothesis checking and underlying manifold dimension estimation. In order to do it we use two very different methods simultaneously - one geometric, another probabilistic - and check whether they give the same result. Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation. The probabilistic method is new. Although it exploits standard nearest neighborhood distance, it is different from methods which were previously used in such situations. This method is robust, fast and includes special preliminary data transformation. Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.