MLLGSTDec 13, 2024

A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data

arXiv:2412.09779v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work provides theoretical insights into deep learning convergence for researchers in machine learning theory, addressing a known bottleneck with incremental improvements in rate analysis.

The paper tackles the problem of understanding test error convergence rates in supervised deep learning for intrinsically low-dimensional data, showing that the error scales as O~(n^{-2β/(2β + d̄_{2β}(λ))}) with an entropic intrinsic dimension, improving on prior rates and establishing polynomial rather than exponential dependence on input dimension d under certain assumptions.

Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension $d$ of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a $β$-Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with $n$ independent and identically distributed samples, the test error scales as $\tilde{\mathcal{O}}\left(n^{-\frac{2β}{2β+ \bar{d}_{2β}(λ)}}\right)$, where $\bar{d}_{2β}(λ)$ is the $2β$-entropic dimension of $λ$, the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as $\tilde{\mathcal{O}}\left( d^{\frac{2\lfloorβ\rfloor(β+ d)}{2β+ d}}n^{-\frac{2β}{2β+ d}}\right)$, establishing that the dependence on $d$ is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes