LGMay 25
Stochastic Estimation of the Layer-wise Hessian Trace for Monitoring Neural-network TrainingMaxim Bolshim, Alexander Kugaevskikh
The loss and the norm of its gradient separate the healthy and the pathological regimes of neural-network training only weakly, whilst the curvature of the empirical risk differs qualitatively between them but is inaccessible explicitly at parameter counts $P\sim 10^{6}-10^{8}$. We present a stochastic estimator of the trace of the diagonal blocks of the Hessian matrix of the empirical risk of a neural network. The procedure combines the Hutchinson stochastic trace estimator with a single Hessian-vector product over the whole parameter vector and recovers unbiased estimates of every per-layer trace in one backward pass through the computational graph. We show that correctness under weight sharing requires the layer-wise Hessian to be assembled before the second differentiation: unrolling shared weights into independent coordinates introduces a systematic bias whose sign and magnitude are governed by the cross-instance blocks of the unrolled Hessian. A closed-form expression for the variance of the estimator at a fixed Hessian is derived, together with a decomposition of the total variance under the mini-batch sampling distribution. This decomposition yields a critical probe count $K^{\star}$ that balances the two sources of randomness and supports the practical recommendation $K\in[5,10]$ in the on-line monitoring regime. The estimator is applied to the detection of the label-memorisation regime of ResNet-18, ResNet-34, and VGG-11 on CIFAR-10 and CIFAR-100, where a calibrated cumulative-sum decision rule attains an empirical detection power of $179/180$ at a false-alarm rate of $16/120$.
LGApr 13
Inter-Layer Hessian Analysis of Neural Networks with DAG ArchitecturesMaxim Bolshim, Alexander Kugaevskikh
Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition $H = H^{GN} + H^T$ separates the Gauss--Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ($H^{T}_{v,w}\!\equiv\!0$ a.e., $H^f_{v,w}\!=\!H^{GN}_{v,w}\!\succeq\!0$); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~$\mathcal{R}$, geometric coupling~$\mathcal{C}$, stable rank~$\mathcal{D}$, GN-Gap) that are estimated stochastically in $O(P)$ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.\,1--5) and convolutional architectures (ResNet-18, ${\sim}11$M~parameters, Exp.\,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian $\nabla^2_θ\mathcal{L}(θ)\in\mathbb{R}^{p\times p}$.
LGOct 30, 2025
Wasserstein Regression as a Variational Approximation of Probabilistic Trajectories through the Bernstein BasisMaksim Maslov, Alexander Kugaevskikh, Matthew Ivanov
This paper considers the problem of regression over distributions, which is becoming increasingly important in machine learning. Existing approaches often ignore the geometry of the probability space or are computationally expensive. To overcome these limitations, a new method is proposed that combines the parameterization of probability trajectories using a Bernstein basis and the minimization of the Wasserstein distance between distributions. The key idea is to model a conditional distribution as a smooth probability trajectory defined by a weighted sum of Gaussian components whose parameters -- the mean and covariance -- are functions of the input variable constructed using Bernstein polynomials. The loss function is the averaged squared Wasserstein distance between the predicted Gaussian distributions and the empirical data, which takes into account the geometry of the distributions. An autodiff-based optimization method is used to train the model. Experiments on synthetic datasets that include complex trajectories demonstrated that the proposed method provides competitive approximation quality in terms of the Wasserstein distance, Energy Distance, and RMSE metrics, especially in cases of pronounced nonlinearity. The model demonstrates trajectory smoothness that is better than or comparable to alternatives and robustness to changes in data structure, while maintaining high interpretability due to explicit parameterization via control points. The developed approach represents a balanced solution that combines geometric accuracy, computational practicality, and interpretability. Prospects for further research include extending the method to non-Gaussian distributions, applying entropy regularization to speed up computations, and adapting the approach to working with high-dimensional data for approximating surfaces and more complex structures.
LGOct 20, 2025
Local properties of neural networks through the lens of layer-wise HessiansMaxim Bolshim, Alexander Kugaevskikh
We introduce a methodology for analyzing neural networks through the lens of layer-wise Hessian matrices. The local Hessian of each functional block (layer) is defined as the matrix of second derivatives of a scalar function with respect to the parameters of that layer. This concept provides a formal tool for characterizing the local geometry of the parameter space. We show that the spectral properties of local Hessians, such as the distribution of eigenvalues, reveal quantitative patterns associated with overfitting, underparameterization, and expressivity in neural network architectures. We conduct an extensive empirical study involving 111 experiments across 37 datasets. The results demonstrate consistent structural regularities in the evolution of local Hessians during training and highlight correlations between their spectra and generalization performance. These findings establish a foundation for using local geometric analysis to guide the diagnosis and design of deep neural networks. The proposed framework connects optimization geometry with functional behavior and offers practical insight for improving network architectures and training stability.