NAOct 21, 2023
Graph Neural Networks and Applied Linear AlgebraNicholas S. Moore, Eric C. Cyr, Peter Ohm et al.
Sparse matrix computations are ubiquitous in scientific computing. With the recent interest in scientific machine learning, it is natural to ask how sparse matrix computations can leverage neural networks (NN). Unfortunately, multi-layer perceptron (MLP) neural networks are typically not natural for either graph or sparse matrix computations. The issue lies with the fact that MLPs require fixed-sized inputs while scientific applications generally generate sparse matrices with arbitrary dimensions and a wide range of nonzero patterns (or matrix graph vertex interconnections). While convolutional NNs could possibly address matrix graphs where all vertices have the same number of nearest neighbors, a more general approach is needed for arbitrary sparse matrices, e.g. arising from discretized partial differential equations on unstructured meshes. Graph neural networks (GNNs) are one approach suitable to sparse matrices. GNNs define aggregation functions (e.g., summations) that operate on variable size input data to produce data of a fixed output size so that MLPs can be applied. The goal of this paper is to provide an introduction to GNNs for a numerical linear algebra audience. Concrete examples are provided to illustrate how many common linear algebra tasks can be accomplished using GNNs. We focus on iterative methods that employ computational kernels such as matrix-vector products, interpolation, relaxation methods, and strength-of-connection measures. Our GNN examples include cases where parameters are determined a-priori as well as cases where parameters must be learned. The intent with this article is to help computational scientists understand how GNNs can be used to adapt machine learning concepts to computational tasks associated with sparse matrices. It is hoped that this understanding will stimulate data-driven extensions of classical sparse linear algebra tasks.
LGDec 1, 2025
Domain-Decomposed Graph Neural Network Surrogate Modeling for Ice SheetsAdrienne M. Propp, Mauro Perego, Eric C. Cyr et al.
Accurate yet efficient surrogate models are essential for large-scale simulations of partial differential equations (PDEs), particularly for uncertainty quantification (UQ) tasks that demand hundreds or thousands of evaluations. We develop a physics-inspired graph neural network (GNN) surrogate that operates directly on unstructured meshes and leverages the flexibility of graph attention. To improve both training efficiency and generalization properties of the model, we introduce a domain decomposition (DD) strategy that partitions the mesh into subdomains, trains local GNN surrogates in parallel, and aggregates their predictions. We then employ transfer learning to fine-tune models across subdomains, accelerating training and improving accuracy in data-limited settings. Applied to ice sheet simulations, our approach accurately predicts full-field velocities on high-resolution meshes, substantially reduces training time relative to training a single global surrogate model, and provides a ripe foundation for UQ objectives. Our results demonstrate that graph-based DD, combined with transfer learning, provides a scalable and reliable pathway for training GNN surrogates on massive PDE-governed systems, with broad potential for application beyond ice sheet dynamics.
CVMar 7, 2022
Parallel Training of GRU Networks with a Multi-Grid Solver for Long SequencesGordon Euhyun Moon, Eric C. Cyr
Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task, as the training procedure of GRU is inherently sequential. Prior efforts to parallelize GRU have largely focused on conventional parallelization strategies such as data-parallel and model-parallel training algorithms. However, when the given sequences are very long, existing approaches are still inevitably performance limited in terms of training time. In this paper, we present a novel parallel training scheme (called parallel-in-time) for GRU based on a multigrid reduction in time (MGRIT) solver. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. The key to achieving speedup is a hierarchical correction of the hidden state to accelerate end-to-end communication in both the forward and backward propagation phases of gradient descent. Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$\times$ speedup over a serial approach. As efficiency of our new parallelization strategy is associated with the sequence length, our parallel GRU algorithm achieves significant performance improvement as the sequence length increases.
37.7LGMay 23
Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient SpectraBen S. Southworth, Shuai Jiang, Daniel McBride et al.
Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.
CVJul 31, 2024Code
DDU-Net: A Domain Decomposition-Based CNN for High-Resolution Image Segmentation on Multiple GPUsCorné Verburg, Alexander Heinlein, Eric C. Cyr
The segmentation of ultra-high resolution images poses challenges such as loss of spatial information or computational inefficiency. In this work, a novel approach that combines encoder-decoder architectures with domain decomposition strategies to address these challenges is proposed. Specifically, a domain decomposition-based U-Net (DDU-Net) architecture is introduced, which partitions input images into non-overlapping patches that can be processed independently on separate devices. A communication network is added to facilitate inter-patch information exchange to enhance the understanding of spatial context. Experimental validation is performed on a synthetic dataset that is designed to measure the effectiveness of the communication network. Then, the performance is tested on the DeepGlobe land cover classification dataset as a real-world benchmark data set. The results demonstrate that the approach, which includes inter-patch communication for images divided into $16\times16$ non-overlapping subimages, achieves a $2-3\,\%$ higher intersection over union (IoU) score compared to the same network without inter-patch communication. The performance of the network which includes communication is equivalent to that of a baseline U-Net trained on the full image, showing that our model provides an effective solution for segmenting ultra-high-resolution images while preserving spatial context. The code is available at https://github.com/corne00/DDU-Net.
55.8NAApr 29
Structure-Aware Tensorial Model ReductionArjun Vijaywargia, Eric C. Cyr, Anthony Gruber
This work investigates a two-stage method for constructing projection-based reduced-order models (ROMs) of parameterized partial differential equations (PDEs). Based on established tensorial ROM methodology, the proposed approach reduces dimensionality offline by encoding solution snapshots using a multi-linear Tucker factorization, so that a reduced basis which varies nonlinearly with PDE parameters can be rapidly constructed online and used in a Galerkin ROM. Two novel extensions of this strategy, tailored to the cases of structured PDEs and sparse parameter sampling, are presented: the construction of reduced bases orthonormalized with respect to a general discrete inner product, and the interpolation of encoded states via radial basis functions. Basic representation and ROM error estimates are presented demonstrating the validity of these modifications, and the approach is challenged on examples where monolithic-basis ROMs are known to struggle, including a realistic instance of Maxwell's equations in 3D. Results suggest that the proposed nonlinear basis ROM can effectively mitigate linear restrictions on Kolmogorov $n$-width while improving upon previous tensorial ROM technology, particularly in the highly nonlinear and data-limited regimes characteristic of practical use cases.
63.8LGApr 22
A Hybridizable Neural Time Integrator for Stable Autoregressive ForecastingBrooks Kinch, Xiaozhe Hu, Yilong Huang et al.
For autoregressive modeling of chaotic dynamical systems over long time horizons, the stability of both training and inference is a major challenge in building scientific foundation models. We present a hybrid technique in which an autoregressive transformer is embedded within a novel shooting-based mixed finite element scheme, exposing topological structure that enables provable stability. For forward problems, we prove preservation of discrete energies, while for training we prove uniform bounds on gradients, provably avoiding the exploding gradient problem. Combined with a vision transformer, this yields latent tokens admitting structure-preserving dynamics. We outperform modern foundation models with a $65\times$ reduction in model parameters and long-horizon forecasting of chaotic systems. A "mini-foundation" model of a fusion component shows that 12 simulations suffice to train a real-time surrogate, achieving a $9{,}000\times$ speedup over particle-in-cell simulation.
LGMar 5
Multilevel Training for Kolmogorov Arnold NetworksBen S. Southworth, Jonas A. Actor, Graham Harper et al.
Algorithmic speedup of training common neural architectures is made difficult by the lack of structure guaranteed by the function compositions inherent to such networks. In contrast to multilayer perceptrons (MLPs), Kolmogorov-Arnold networks (KANs) provide more structure by expanding learned activations in a specified basis. This paper exploits this structure to develop practical algorithms and theoretical insights, yielding training speedup via multilevel training for KANs. To do so, we first establish an equivalence between KANs with spline basis functions and multichannel MLPs with power ReLU activations through a linear change of basis. We then analyze how this change of basis affects the geometry of gradient-based optimization with respect to spline knots. The KANs change-of-basis motivates a multilevel training approach, where we train a sequence of KANs naturally defined through a uniform refinement of spline knots with analytic geometric interpolation operators between models. The interpolation scheme enables a ``properly nested hierarchy'' of architectures, ensuring that interpolation to a fine model preserves the progress made on coarse models, while the compact support of spline basis functions ensures complementary optimization on subsequent levels. Numerical experiments demonstrate that our multilevel training approach can achieve orders of magnitude improvement in accuracy over conventional methods to train comparable KANs or MLPs, particularly for physics informed neural networks. Finally, this work demonstrates how principled design of neural networks can lead to exploitable structure, and in this case, multilevel algorithms that can dramatically improve training performance.
LGJan 13
Layer-Parallel Training for TransformersShuai Jiang, Marc Salvado, Eric C. Cyr et al.
We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.
LGSep 4, 2025
Deriving Transformer Architectures as Implicit Multinomial RegressionJonas A. Actor, Anthony Gruber, Eric C. Cyr
While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields solutions that align with the dynamics induced on features by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.
LGMay 23, 2025
Leveraging KANs for Expedient Training of Multichannel MLPs via Preconditioning and Geometric RefinementJonas A. Actor, Graham Harper, Ben Southworth et al.
Multilayer perceptrons (MLPs) are a workhorse machine learning architecture, used in a variety of modern deep learning frameworks. However, recently Kolmogorov-Arnold Networks (KANs) have become increasingly popular due to their success on a range of problems, particularly for scientific machine learning tasks. In this paper, we exploit the relationship between KANs and multichannel MLPs to gain structural insight into how to train MLPs faster. We demonstrate the KAN basis (1) provides geometric localized support, and (2) acts as a preconditioned descent in the ReLU basis, overall resulting in expedited training and improved accuracy. Our results show the equivalence between free-knot spline KAN architectures, and a class of MLPs that are refined geometrically along the channel dimension of each weight tensor. We exploit this structural equivalence to define a hierarchical refinement scheme that dramatically accelerates training of the multi-channel MLP architecture. We show further accuracy improvements can be had by allowing the $1$D locations of the spline knots to be trained simultaneously with the weights. These advances are demonstrated on a range of benchmark examples for regression and scientific machine learning.
LGJan 27, 2021
Partition of unity networks: deep hp-approximationKookjin Lee, Nathaniel A. Trask, Ravi G. Patel et al.
Approximation theorists have established best-in-class optimal approximation rates of deep neural networks by utilizing their ability to simultaneously emulate partitions of unity and monomials. Motivated by this, we propose partition of unity networks (POUnets) which incorporate these elements directly into the architecture. Classification architectures of the type used to learn probability measures are used to build a meshfree partition of space, while polynomial spaces with learnable coefficients are associated to each partition. The resulting hp-element-like approximation allows use of a fast least-squares optimizer, and the resulting architecture size need not scale exponentially with spatial dimension, breaking the curse of dimensionality. An abstract approximation result establishes desirable properties to guide network design. Numerical results for two choices of architecture demonstrate that POUnets yield hp-convergence for smooth functions and consistently outperform MLPs for piecewise polynomial functions with large numbers of discontinuities.
COMP-PHSep 25, 2020
A physics-informed operator regression framework for extracting data-driven continuum modelsRavi G. Patel, Nathaniel A. Trask, Mitchell A. Wood et al.
The application of deep learning toward discovery of data-driven models requires careful application of inductive biases to obtain a description of physics which is both accurate and robust. We present here a framework for discovering continuum models from high fidelity molecular simulation data. Our approach applies a neural network parameterization of governing physics in modal space, allowing a characterization of differential operators while providing structure which may be used to impose biases related to symmetry, isotropy, and conservation form. We demonstrate the effectiveness of our framework for a variety of physics, including local and nonlocal diffusion processes and single and multiphase flows. For the flow physics we demonstrate this approach leads to a learned operator that generalizes to system characteristics not included in the training sets, such as variable particle sizes, densities, and concentration.
LGJun 17, 2020
A block coordinate descent optimizer for classification problems exploiting convexityRavi G. Patel, Nathaniel A. Trask, Mamikon A. Gulian et al.
Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.
LGDec 19, 2019
Multilevel Initialization for Layer-Parallel Deep Neural Network TrainingEric C. Cyr, Stefanie Günther, Jacob B. Schroder
This paper investigates multilevel initialization strategies for training very deep neural networks with a layer-parallel multigrid solver. The scheme is based on the continuous interpretation of the training problem as a problem of optimal control, in which neural networks are represented as discretizations of time-dependent ordinary differential equations. A key goal is to develop a method able to intelligently initialize the network parameters for the very deep networks enabled by scalable layer-parallel training. To do this, we apply a refinement strategy across the time domain, that is equivalent to refining in the layer dimension. The resulting refinements create deep networks, with good initializations for the network parameters coming from the coarser trained networks. We investigate the effectiveness of such multilevel "nested iteration" strategies for network training, showing supporting numerical evidence of reduced run time for equivalent accuracy. In addition, we study whether the initialization strategies provide a regularizing effect on the overall training process and reduce sensitivity to hyperparameters and randomness in initial network parameters.
LGDec 10, 2019
Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis ViewpointEric C. Cyr, Mamikon A. Gulian, Ravi G. Patel et al.
Motivated by the gap between theoretical optimal approximation rates of deep neural networks (DNNs) and the accuracy realized in practice, we seek to improve the training of DNNs. The adoption of an adaptive basis viewpoint of DNNs leads to novel initializations and a hybrid least squares/gradient descent optimizer. We provide analysis of these techniques and illustrate via numerical examples dramatic increases in accuracy and convergence rate for benchmarks characterizing scientific applications where DNNs are currently used, including regression problems and physics-informed neural networks for the solution of partial differential equations.