ML LG STDec 13, 2024

A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data

Saptarshi Chakraborty, Peter L. Bartlett

arXiv:2412.09779v15.51 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work provides theoretical insights into deep learning convergence for researchers in machine learning theory, addressing a known bottleneck with incremental improvements in rate analysis.

The paper tackles the problem of understanding test error convergence rates in supervised deep learning for intrinsically low-dimensional data, showing that the error scales as O~(n^{-2β/(2β + d̄_{2β}(λ))}) with an entropic intrinsic dimension, improving on prior rates and establishing polynomial rather than exponential dependence on input dimension d under certain assumptions.

Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension $d$ of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a $β$-Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with $n$ independent and identically distributed samples, the test error scales as $\tilde{\mathcal{O}}\left(n^{-\frac{2β}{2β+ \bar{d}_{2β}(λ)}}\right)$, where $\bar{d}_{2β}(λ)$ is the $2β$-entropic dimension of $λ$, the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as $\tilde{\mathcal{O}}\left( d^{\frac{2\lfloorβ\rfloor(β+ d)}{2β+ d}}n^{-\frac{2β}{2β+ d}}\right)$, establishing that the dependence on $d$ is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.

View on arXiv PDF

Similar