LGDIS-NNSTMLAug 11, 2024

Kernel Density Estimators in Large Dimensions

arXiv:2408.05807v37 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the challenge of kernel density estimation for practitioners dealing with high-dimensional data, providing theoretical insights into statistical transitions, though it is incremental as it builds on traditional approaches by extending analysis to new regimes.

The paper tackles the problem of kernel density estimation in high-dimensional settings by analyzing the regime where both the number of data points and dimensionality grow with a fixed ratio, revealing three distinct statistical regimes based on bandwidth, including a breakdown of the Central Limit Theorem and transitions to heavy-tailed distributions. It shows that the optimal bandwidth threshold lies in a new statistical regime, offering insights for high-dimensional applications.

This paper studies Kernel Density Estimation for a high-dimensional distribution $ρ(x)$. Traditional approaches have focused on the limit of large number of data points $n$ and fixed dimension $d$. We analyze instead the regime where both the number $n$ of data points $y_i$ and their dimensionality $d$ grow with a fixed ratio $α=(\log n)/d$. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density $\hat ρ_h^{\mathcal {D}}(x)=\frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-y_i}{h}\right)$, depending on the bandwidth $h$: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, $h_{CLT}(α)$, we find that the CLT breaks down. The statistics of $\hatρ_h^{\mathcal {D}}(x)$ for a fixed $x$ drawn from $ρ(x)$ is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value $h_G(α)$, we find that $\hatρ_h^{\mathcal {D}}(x)$ is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. As known by practitioners, when decreasing the bandwidth a Kernel-estimated estimated changes from a smooth curve to a collections of peaks centred on the data points. Our findings reveal that this general phenomenon is related to sharp transitions between phases characterized by different statistical properties, and offer new insights for Kernel density estimation in high-dimensional settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes