NAJun 12, 2013
Polyharmonic homogenization, rough polyharmonic splines and sparse super-localizationHouman Owhadi, Lei Zhang, Leonid Berlyand
We introduce a new variational method for the numerical homogenization of divergence form elliptic, parabolic and hyperbolic equations with arbitrary rough ($L^\infty$) coefficients. Our method does not rely on concepts of ergodicity or scale-separation but on compactness properties of the solution space and a new variational approach to homogenization. The approximation space is generated by an interpolation basis (over scattered points forming a mesh of resolution $H$) minimizing the $L^2$ norm of the source terms; its (pre-)computation involves minimizing $\mathcal{O}(H^{-d})$ quadratic (cell) problems on (super-)localized sub-domains of size $\mathcal{O}(H \ln (1/ H))$. The resulting localized linear systems remain sparse and banded. The resulting interpolation basis functions are biharmonic for $d\leq 3$, and polyharmonic for $d\geq 4$, for the operator $-\diiv(a\nabla \cdot)$ and can be seen as a generalization of polyharmonic splines to differential operators with arbitrary rough coefficients. The accuracy of the method ($\mathcal{O}(H)$ in energy norm and independent from aspect ratios of the mesh formed by the scattered points) is established via the introduction of a new class of higher-order Poincaré inequalities. The method bypasses (pre-)computations on the full domain and naturally generalizes to time dependent problems, it also provides a natural solution to the inverse problem of recovering the solution of a divergence form elliptic equation from a finite number of point measurements.
APJan 11, 2010
Flux norm approach to finite dimensional homogenization approximations with non-separated scales and high contrastLeonid Berlyand, Houman Owhadi
We consider divergence-form scalar elliptic equations and vectorial equations for elasticity with rough ($L^\infty(Ω)$, $Ω\subset \R^d$) coefficients $a(x)$ that, in particular, model media with non-separated scales and high contrast in material properties. We define the flux norm as the $L^2$ norm of the potential part of the fluxes of solutions, which is equivalent to the usual $H^1$-norm. We show that in the flux norm, the error associated with approximating, in a properly defined finite-dimensional space, the set of solutions of the aforementioned PDEs with rough coefficients is equal to the error associated with approximating the set of solutions of the same type of PDEs with smooth coefficients in a standard space (e.g., piecewise polynomial). We refer to this property as the {\it transfer property}. A simple application of this property is the construction of finite dimensional approximation spaces with errors independent of the regularity and contrast of the coefficients and with optimal and explicit convergence rates. This transfer property also provides an alternative to the global harmonic change of coordinates for the homogenization of elliptic operators that can be extended to elasticity equations. The proofs of these homogenization results are based on a new class of elliptic inequalities which play the same role in our approach as the div-curl lemma in classical homogenization.
LGMay 23
Pruning Deep Neural Networks via the Marchenko--Pastur DistributionLeonid Berlyand, Theo Bourdais, Houman Owhad et al.
We study a Marchenko--Pastur (MP) random-matrix approach to pruning deep neural networks with very small post-pruning fine-tuning budgets. The main practical contribution is accuracy retention under short calibration and fine-tuning schedules, rather than a long post-pruning reoptimization pipeline. The theory gives deterministic data-path certificates: if the removed component $R$ has small propagated logit effect $L_s \| R ψ_1(s) \|_\infty$, pruning decreases an elastic-net objective and preserves samples whose dense margin exceeds twice the perturbation. The zero-budget case gives perfect pruning; a prune--restore extension models weight restoration inside a fixed sparse-execution pattern; and an additive $L_2$-regularized model shows admissible random-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses. Under iid-Gaussian sufficient conditions, the fitted MP edge $σ_+$ gives a high-probability layerwise budget signal. On ImageNet-1k, after only three distillation epochs, ViT-B/16 $2{:}4{+}$ToMe reaches $83.41\%$ top-1 ($-1.70$ pp from dense) at $59.81\%$ sparse-execution MAC reduction, with $1.388\times$ best-observed A40 native-$2{:}4$ backend speedup for the same checkpoint and ToMe graph; a separate no-ToMe A100 endpoint gives $2.705\times$. At structured sparsity, ViT-B/16 $6{:}12$ reaches $83.74\%$, ViT-L/16 $8{:}16$ dense+permutation reaches $85.33\%$ ($-0.51$ pp), and ConvNeXtV2-Base $12{:}16$ reaches $86.35\%$ ($-0.37$ pp). For CNNs, ResNet50 $8{:}16$ dense+permutation reaches $75.87\%$ ($-0.26$ pp), and ResNet152d CAST-conv+permutation reaches $81.33\%$ ($-1.53$ pp) at ${\sim}50\%$ MAC accounting with a $1.62\times$ A40 im2col$+2{:}4$ sparse-GEMM audit.
LGOct 4, 2023
Enhancing Accuracy in Deep Learning Using Random Matrix TheoryLeonid Berlyand, Etienne Sandier, Yitzchak Shmalo et al.
We explore the applications of random matrix theory (RMT) in the training of deep neural networks (DNNs), focusing on layer pruning that is reducing the number of DNN parameters (weights). Our numerical results show that this pruning leads to a drastic reduction of parameters while not reducing the accuracy of DNNs and CNNs. Moreover, pruning the fully connected DNNs actually increases the accuracy and decreases the variance for random initializations. Our numerics indicate that this enhancement in accuracy is due to the simplification of the loss landscape. We next provide rigorous mathematical underpinning of these numerical results by proving the RMT-based Pruning Theorem. Our results offer valuable insights into the practical application of RMT for the creation of more efficient and accurate deep-learning models.
MATH-PHJul 16, 2025
Asymptotic behavior of eigenvalues of large rank perturbations of large random matricesIevgenii Afanasiev, Leonid Berlyand, Mariia Kiyashko
The paper is concerned with deformed Wigner random matrices. These matrices are closely connected with Deep Neural Networks (DNNs): weight matrices of trained DNNs could be represented in the form $R + S$, where $R$ is random and $S$ is highly correlated. The spectrum of such matrices plays a key role in rigorous underpinning of the novel pruning technique based on Random Matrix Theory. Mathematics has been done only for finite-rank matrix $S$. However, in practice rank may grow. In this paper we develop asymptotic analysis for the case of growing rank.
LGMar 2, 2025
Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and RegularizationLeonid Berlyand, Theo Bourdais, Houman Owhadi et al.
Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50\% with less than 1\% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.
NAJun 4, 2021
A novel multi-scale loss function for classification problems in machine learningLeonid Berlyand, Robert Creese, Pierre-Emmanuel Jabin
We introduce two-scale loss functions for use in various gradient descent algorithms applied to classification problems via deep neural networks. This new method is generic in the sense that it can be applied to a wide range of machine learning architectures, from deep neural networks to support vector machines for example. These two-scale loss functions allow to focus the training onto objects in the training set which are not well classified. This leads to an increase in several measures of performance for appropriately-defined two-scale loss functions with respect to the more classical cross-entropy when tested on traditional deep neural networks on the MNIST, CIFAR10, and CIFAR100 data-sets.
APFeb 10, 2020
Stability for the Training of Deep Neural Networks and Other ClassifiersLeonid Berlyand, Pierre-Emmanuel Jabin, C. Alex Safsten
We examine the stability of loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training through a so-called loss function, the performance of classifiers is usually evaluated by some measure of accuracy, such as the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the guiding question of stability: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability, and provide examples of instability. Our main result consists of two novel conditions on the classifier which, if either is satisfied, ensure stability of training, that is we derive tight bounds on accuracy as loss decreases. We also derive a sufficient condition for stability on the training set alone, identifying flat portions of the data manifold as potential sources of instability. The latter condition is explicitly verifiable on the training dataset. Our results do not depend on the algorithm used for training, as long as loss decreases with training.