Ben S. Southworth

LG
h-index20
11papers
9citations
Novelty53%
AI Score50

11 Papers

NAMar 1, 2019
Nonsymmetric Algebraic Multigrid Based on Local Approximate Ideal Restriction (lAIR)

Ben S. Southworth, Thomas A. Manteuffel, John Ruge

Algebraic multigrid (AMG) solvers and preconditioners are some of the fastest numerical methods to solve linear systems, particularly in a parallel environment, scaling to hundreds of thousands of cores. Most AMG methods and theory assume a symmetric positive definite operator. This paper presents a new variation on classical AMG for nonsymmetric matrices (denoted lAIR), based on a local approximation to the ideal restriction operator, coupled with F-relaxation. A new block decomposition of the AMG error-propagation operator is used for a spectral analysis of convergence, and the efficacy of the algorithm is demonstrated on systems arising from the discrete form of the advection-diffusion-reaction equation. lAIR is shown to be a robust solver for various discretizations of the advection-diffusion-reaction equation, including time-dependent and steady-state, from purely advective to purely diffusive. Convergence is robust for discretizations on unstructured meshes and using higher-order finite elements, and is particularly effective on upwind discontinuous Galerkin discretizations. Although the implementation used here is not parallel, each part of the algorithm is highly parallelizable, avoiding common multigrid adjustments for strong advection such as line-relaxation and K- or W-cycles that can be effective in serial, but suffer from high communication costs in parallel, limiting their scalability.

NAJan 29, 2018
A Root-Node Based Algebraic Multigrid Method

Thomas A. Manteuffel, Luke N. Olson, Jacob B. Schroder et al.

This paper provides a unified and detailed presentation of root-node style algebraic multigrid (AMG). Algebraic multigrid is a popular and effective iterative method for solving large, sparse linear systems that arise from discretizing partial differential equations. However, while AMG is designed for symmetric positive definite matrices (SPD), certain SPD problems, such as anisotropic diffusion, are still not adequately addressed by existing methods. Non-SPD problems pose an even greater challenge, and in practice AMG is often not considered as a solver for such problems. The focus of this paper is on so-called root-node AMG, which can be viewed as a combination of classical and aggregation-based multigrid. An algorithm for root-node is outlined and a filtering strategy is developed, which is able to control the cost of using root-node AMG, particularly on difficult problems. New theoretical motivation is provided for root-node and energy-minimization as applied to symmetric as well non-symmetric systems. Numerical results are then presented demonstrating the robust ability of root-node to solve non-symmetric problems, systems-based problems, and difficult SPD problems, including strongly anisotropic diffusion, convection-diffusion, and upwind steady-state transport, in a scalable manner. New, detailed estimates of the computational cost of the setup and solve phase are given for each example, providing additional support for root-node AMG over alternative methods.

NAJun 4, 2019
Multilevel convergence analysis of multigrid-reduction-in-time

Andreas Hessenthaler, Ben S. Southworth, David Nordsletten et al.

This paper presents a multilevel convergence framework for multigrid-reduction-in-time (MGRIT) as a generalization of previous two-grid estimates. The framework provides a priori upper bounds on the convergence of MGRIT V- and F-cycles, with different relaxation schemes, by deriving the respective residual and error propagation operators. The residual and error operators are functions of the time stepping operator, analyzed directly and bounded in norm, both numerically and analytically. We present various upper bounds of different computational cost and varying sharpness. These upper bounds are complemented by proposing analytic formulae for the approximate convergence factor of V-cycle algorithms that take the number of fine grid time points, the temporal coarsening factors, and the eigenvalues of the time stepping operator as parameters. The paper concludes with supporting numerical investigations of parabolic (anisotropic diffusion) and hyperbolic (wave equation) model problems. We assess the sharpness of the bounds and the quality of the approximate convergence factors. Observations from these numerical investigations demonstrate the value of the proposed multilevel convergence framework for estimating MGRIT convergence a priori and for the design of a convergent algorithm. We further highlight that observations in the literature are captured by the theory, including that two-level Parareal and multilevel MGRIT with F-relaxation do not yield scalable algorithms and the benefit of a stronger relaxation scheme. An important observation is that with increasing numbers of levels MGRIT convergence deteriorates for the hyperbolic model problem, while constant convergence factors can be achieved for the diffusion equation. The theory also indicates that L-stable Runge-Kutta schemes are more amendable to multilevel parallel-in-time integration with MGRIT than A-stable Runge-Kutta schemes.

NAFeb 13, 2019
The Role of Energy Minimization in Algebraic Multigrid Interpolation

James Brannick, Scott P. MacLachlan, Jacob B. Schroder et al.

Algebraic multigrid (AMG) methods are powerful solvers with linear or near-linear computational complexity for certain classes of linear systems, Ax=b. Broadening the scope of problems that AMG can effectively solve requires the development of improved interpolation operators. Such development is often based on AMG convergence theory. However, convergence theory in AMG tends to have a disconnect with AMG in practice due to the practical constraints of (i) maintaining matrix sparsity in transfer and coarse-grid operators, and (ii) retaining linear complexity in the setup and solve phase. This paper presents a review of fundamental results in AMG convergence theory, followed by a discussion on how these results can be used to motivate interpolation operators in practice. A general weighted energy minimization functional is then proposed to form interpolation operators, and a novel `diagonal' preconditioner for Sylvester- or Lyapunov-type equations developed simultaneously. Although results based on the weighted energy minimization typically underperform compared to a fully constrained energy minimization, numerical results provide new insight into the role of energy minimization and constraint vectors in AMG interpolation.

LGMay 23
Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

Ben S. Southworth, Shuai Jiang, Daniel McBride et al.

Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.

LGMar 18
Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ben S. Southworth, Stephen Thomas

Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram--Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50\% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead -- relative to Muon, MUD improves peak tokens/s by roughly $1.3-2.6\times$ across most settings and up to nearly $3\times$ on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.

COMP-PHDec 28, 2024
Physics consistent machine learning framework for inverse modeling with applications to ICF capsule implosions

Daniel A. Serino, Evan Bell, Marc Klasky et al.

In high energy density physics (HEDP) and inertial confinement fusion (ICF), predictive modeling is complicated by uncertainty in parameters that characterize various aspects of the modeled system, such as those characterizing material properties, equation of state (EOS), opacities, and initial conditions. Typically, however, these parameters are not directly observable. What is observed instead is a time sequence of radiographic projections using X-rays. In this work, we define a set of sparse hydrodynamic features derived from the outgoing shock profile and outer material edge, which can be obtained from radiographic measurements, to directly infer such parameters. Our machine learning (ML)-based methodology involves a pipeline of two architectures, a radiograph-to-features network (R2FNet) and a features-to-parameters network (F2PNet), that are trained independently and later combined to approximate a posterior distribution for the parameters from radiographs. We show that the estimated parameters can be used in a hydrodynamics code to obtain density fields and hydrodynamic shock and outer edge features that are consistent with the data. Finally, we demonstrate that features resulting from an unknown EOS model can be successfully mapped onto parameters of a chosen analytical EOS model, implying that network predictions are learning physics, with a degree of invariance to the underlying choice of EOS model.

COMP-PHSep 5, 2025
Causal Multi-fidelity Surrogate Forward and Inverse Models for ICF Implosions

Tyler E. Maltba, Ben S. Southworth, Jeffrey R. Haack et al.

Continued progress in inertial confinement fusion (ICF) requires solving inverse problems relating experimental observations to simulation input parameters, followed by design optimization. However, such high dimensional dynamic PDE-constrained optimization problems are extremely challenging or even intractable. It has been recently shown that inverse problems can be solved by only considering certain robust features. Here we consider the ICF capsule's deuterium-tritium (DT) interface, and construct a causal, dynamic, multifidelity reduced-order surrogate that maps from a time-dependent radiation temperature drive to the interface's radius and velocity dynamics. The surrogate targets an ODE embedding of DT interface dynamics, and is constructed by learning a controller for a base analytical model using low- and high-fidelity simulation training data with respect to radiation energy group structure. After demonstrating excellent accuracy of the surrogate interface model, we use machine learning (ML) models with surrogate-generated data to solve inverse problems optimizing radiation temperature drive to reproduce observed interface dynamics. For sparse snapshots in time, the ML model further characterizes the most informative times at which to sample dynamics. Altogether we demonstrate how operator learning, causal architectures, and physical inductive bias can be integrated to accelerate discovery, design, and diagnostics in high-energy-density systems.

COMP-PHJun 30, 2025
Learning robust parameter inference and density reconstruction in flyer plate impact experiments

Evan Bell, Daniel A. Serino, Ben S. Southworth et al.

Estimating physical parameters or material properties from experimental observations is a common objective in many areas of physics and material science. In many experiments, especially in shock physics, radiography is the primary means of observing the system of interest. However, radiography does not provide direct access to key state variables, such as density, which prevents the application of traditional parameter estimation approaches. Here we focus on flyer plate impact experiments on porous materials, and resolving the underlying parameterized equation of state (EoS) and crush porosity model parameters given radiographic observation(s). We use machine learning as a tool to demonstrate with high confidence that using only high impact velocity data does not provide sufficient information to accurately infer both EoS and crush model parameters, even with fully resolved density fields or a dynamic sequence of images. We thus propose an observable data set consisting of low and high impact velocity experiments/simulations that capture different regimes of compaction and shock propagation, and proceed to introduce a generative machine learning approach which produces a posterior distribution of physical parameters directly from radiographs. We demonstrate the effectiveness of the approach in estimating parameters from simulated flyer plate impact experiments, and show that the obtained estimates of EoS and crush model parameters can then be used in hydrodynamic simulations to obtain accurate and physically admissible density reconstructions. Finally, we examine the robustness of the approach to model mismatches, and find that the learned approach can provide useful parameter estimates in the presence of out-of-distribution radiographic noise and previously unseen physics, thereby promoting a potential breakthrough in estimating material properties from experimental radiographic images.

LGMar 5
Multilevel Training for Kolmogorov Arnold Networks

Ben S. Southworth, Jonas A. Actor, Graham Harper et al.

Algorithmic speedup of training common neural architectures is made difficult by the lack of structure guaranteed by the function compositions inherent to such networks. In contrast to multilayer perceptrons (MLPs), Kolmogorov-Arnold networks (KANs) provide more structure by expanding learned activations in a specified basis. This paper exploits this structure to develop practical algorithms and theoretical insights, yielding training speedup via multilevel training for KANs. To do so, we first establish an equivalence between KANs with spline basis functions and multichannel MLPs with power ReLU activations through a linear change of basis. We then analyze how this change of basis affects the geometry of gradient-based optimization with respect to spline knots. The KANs change-of-basis motivates a multilevel training approach, where we train a sequence of KANs naturally defined through a uniform refinement of spline knots with analytic geometric interpolation operators between models. The interpolation scheme enables a ``properly nested hierarchy'' of architectures, ensuring that interpolation to a fine model preserves the progress made on coarse models, while the compact support of spline basis functions ensures complementary optimization on subsequent levels. Numerical experiments demonstrate that our multilevel training approach can achieve orders of magnitude improvement in accuracy over conventional methods to train comparable KANs or MLPs, particularly for physics informed neural networks. Finally, this work demonstrates how principled design of neural networks can lead to exploitable structure, and in this case, multilevel algorithms that can dramatically improve training performance.

NAMay 13, 2019
Necessary Conditions and Tight Two-level Convergence Bounds for Parareal and Multigrid Reduction in Time

Ben S. Southworth

Parareal and multigrid reduction in time (MGRiT) are two of the most popular parallel-in-time methods. The idea is to treat time integration in a parallel context by using a multigrid method in time. If $Φ$ is a (fine-grid) time-stepping scheme, let $Ψ$ denote a "coarse-grid" time-stepping scheme chosen to approximate $k$ steps of $Φ$, $k\geq 1$. In particular, $Ψ$ defines the coarse-grid correction, and evaluating $Ψ$ should be (significantly) cheaper than evaluating $Φ^k$. A number of papers have studied the convergence of Parareal and MGRiT. However, there have yet to be general conditions developed on the convergence of Parareal or MGRiT that answer simple questions such as, (i) for a given $Φ$ and $k$, what is the best $Ψ$, or (ii) can Parareal/MGRiT converge for my problem? This work derives necessary and sufficient conditions for the convergence of Parareal and MGRiT applied to linear problems, along with tight two-level convergence bounds. Results rest on the introduction of a "temporal approximation property" (TAP) that indicates how $Φ^k$ must approximate the action of $Ψ$ on different vectors. Loosely, for unitarily diagonalizable operators, the TAP indicates that fine-grid and coarse-grid time integration schemes must integrate geometrically smooth spatial components similarly, and less so for geometrically high frequency. In the (non-unitarily) diagonalizable setting, the conditioning of each eigenvector, $\mathbf{v}_i$, must also be reflected in how well $Ψ\mathbf{v}_i \simΦ^k\mathbf{v}_i$. In general, worst-case convergence bounds are exactly given by $\min φ< 1$ such that an inequality along the lines of $\|(Ψ-Φ^k)\mathbf{v}\| \leqφ\|(I - Ψ)\mathbf{v}\|$ holds for all $\mathbf{v}$. Such inequalities are formalized as different realizations of the TAP, and form the basis for convergence of MGRiT and Parareal.