Jacob B. Schroder

8papers

138citations

Novelty46%

AI Score38

Ranked #111,165 of 205,806 authors (top 54%)#949 in NA (top 165%)

8 Papers

DCDec 15, 2015

Reducing Parallel Communication in Algebraic Multigrid through Sparsification

Amanda Bienz, Robert D. Falgout William Gropp, Luke N. Olson et al.

Algebraic multigrid (AMG) is an $\mathcal{O}(n)$ solution process for many large sparse linear systems. A hierarchy of progressively coarser grids is constructed that utilize complementary relaxation and interpolation operators. High-energy error is reduced by relaxation, while low-energy error is mapped to coarse-grids and reduced there. However, large parallel communication costs often limit parallel scalability. As the multigrid hierarchy is formed, each coarse matrix is formed through a triple matrix product. The resulting coarse-grids often have significantly more nonzeros per row than the original fine-grid operator, thereby generating high parallel communication costs on coarse-levels. In this paper, we introduce a method that systematically removes entries in coarse-grid matrices after the hierarchy is formed, leading to an improved communication costs. We sparsify by removing weakly connected or unimportant entries in the matrix, leading to improved solve time. The main trade-off is that if the heuristic identifying unimportant entries is used too aggressively, then AMG convergence can suffer. To counteract this, the original hierarchy is retained, allowing entries to be reintroduced into the solver hierarchy if convergence is too slow. This enables a balance between communication cost and convergence, as necessary. In this paper we present new algorithms for reducing communication and present a number of computational experiments in support.

NAJan 29, 2018

A Root-Node Based Algebraic Multigrid Method

Thomas A. Manteuffel, Luke N. Olson, Jacob B. Schroder et al.

This paper provides a unified and detailed presentation of root-node style algebraic multigrid (AMG). Algebraic multigrid is a popular and effective iterative method for solving large, sparse linear systems that arise from discretizing partial differential equations. However, while AMG is designed for symmetric positive definite matrices (SPD), certain SPD problems, such as anisotropic diffusion, are still not adequately addressed by existing methods. Non-SPD problems pose an even greater challenge, and in practice AMG is often not considered as a solver for such problems. The focus of this paper is on so-called root-node AMG, which can be viewed as a combination of classical and aggregation-based multigrid. An algorithm for root-node is outlined and a filtering strategy is developed, which is able to control the cost of using root-node AMG, particularly on difficult problems. New theoretical motivation is provided for root-node and energy-minimization as applied to symmetric as well non-symmetric systems. Numerical results are then presented demonstrating the robust ability of root-node to solve non-symmetric problems, systems-based problems, and difficult SPD problems, including strongly anisotropic diffusion, convection-diffusion, and upwind steady-state transport, in a scalable manner. New, detailed estimates of the computational cost of the setup and solve phase are given for each example, providing additional support for root-node AMG over alternative methods.

NAJun 4, 2019

Multilevel convergence analysis of multigrid-reduction-in-time

Andreas Hessenthaler, Ben S. Southworth, David Nordsletten et al.

This paper presents a multilevel convergence framework for multigrid-reduction-in-time (MGRIT) as a generalization of previous two-grid estimates. The framework provides a priori upper bounds on the convergence of MGRIT V- and F-cycles, with different relaxation schemes, by deriving the respective residual and error propagation operators. The residual and error operators are functions of the time stepping operator, analyzed directly and bounded in norm, both numerically and analytically. We present various upper bounds of different computational cost and varying sharpness. These upper bounds are complemented by proposing analytic formulae for the approximate convergence factor of V-cycle algorithms that take the number of fine grid time points, the temporal coarsening factors, and the eigenvalues of the time stepping operator as parameters. The paper concludes with supporting numerical investigations of parabolic (anisotropic diffusion) and hyperbolic (wave equation) model problems. We assess the sharpness of the bounds and the quality of the approximate convergence factors. Observations from these numerical investigations demonstrate the value of the proposed multilevel convergence framework for estimating MGRIT convergence a priori and for the design of a convergent algorithm. We further highlight that observations in the literature are captured by the theory, including that two-level Parareal and multilevel MGRIT with F-relaxation do not yield scalable algorithms and the benefit of a stronger relaxation scheme. An important observation is that with increasing numbers of levels MGRIT convergence deteriorates for the hyperbolic model problem, while constant convergence factors can be achieved for the diffusion equation. The theory also indicates that L-stable Runge-Kutta schemes are more amendable to multilevel parallel-in-time integration with MGRIT than A-stable Runge-Kutta schemes.

NAFeb 13, 2019

The Role of Energy Minimization in Algebraic Multigrid Interpolation

James Brannick, Scott P. MacLachlan, Jacob B. Schroder et al.

Algebraic multigrid (AMG) methods are powerful solvers with linear or near-linear computational complexity for certain classes of linear systems, Ax=b. Broadening the scope of problems that AMG can effectively solve requires the development of improved interpolation operators. Such development is often based on AMG convergence theory. However, convergence theory in AMG tends to have a disconnect with AMG in practice due to the practical constraints of (i) maintaining matrix sparsity in transfer and coarse-grid operators, and (ii) retaining linear complexity in the setup and solve phase. This paper presents a review of fundamental results in AMG convergence theory, followed by a discussion on how these results can be used to motivate interpolation operators in practice. A general weighted energy minimization functional is then proposed to form interpolation operators, and a novel `diagonal' preconditioner for Sylvester- or Lyapunov-type equations developed simultaneously. Although results based on the weighted energy minimization typically underperform compared to a fully constrained energy minimization, numerical results provide new insight into the role of energy minimization and constraint vectors in AMG interpolation.

NAAug 5, 2022

Parallel Energy-Minimization Prolongation for Algebraic Multigrid

Carlo Janna, Andrea Franceschini, Jacob B. Schroder et al.

Algebraic multigrid (AMG) is one of the most widely used solution techniques for linear systems of equations arising from discretized partial differential equations. The popularity of AMG stems from its potential to solve linear systems in almost linear time, that is with an O(n) complexity, where n is the problem size. This capability is crucial at the present, where the increasing availability of massive HPC platforms pushes for the solution of very large problems. The key for a rapidly converging AMG method is a good interplay between the smoother and the coarse-grid correction, which in turn requires the use of an effective prolongation. From a theoretical viewpoint, the prolongation must accurately represent near kernel components and, at the same time, be bounded in the energy norm. For challenging problems, however, ensuring both these requirements is not easy and is exactly the goal of this work. We propose a constrained minimization procedure aimed at reducing prolongation energy while preserving the near kernel components in the span of interpolation. The proposed algorithm is based on previous energy minimization approaches utilizing a preconditioned restricted conjugate gradients method, but has new features and a specific focus on parallel performance and implementation. It is shown that the resulting solver, when used for large real-world problems from various application fields, exhibits excellent convergence rates and scalability and outperforms at least some more traditional AMG approaches.

LGJan 13

Layer-Parallel Training for Transformers

Shuai Jiang, Marc Salvado, Eric C. Cyr et al.

We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.

LGDec 19, 2019

Multilevel Initialization for Layer-Parallel Deep Neural Network Training

Eric C. Cyr, Stefanie Günther, Jacob B. Schroder

This paper investigates multilevel initialization strategies for training very deep neural networks with a layer-parallel multigrid solver. The scheme is based on the continuous interpretation of the training problem as a problem of optimal control, in which neural networks are represented as discretizations of time-dependent ordinary differential equations. A key goal is to develop a method able to intelligently initialize the network parameters for the very deep networks enabled by scalable layer-parallel training. To do this, we apply a refinement strategy across the time domain, that is equivalent to refining in the layer dimension. The resulting refinements create deep networks, with good initializations for the network parameters coming from the coarser trained networks. We investigate the effectiveness of such multilevel "nested iteration" strategies for network training, showing supporting numerical evidence of reduced run time for equivalent accuracy. In addition, we study whether the initialization strategies provide a regularizing effect on the overall training process and reduce sensitivity to hyperparameters and randomness in initial network parameters.

NAAug 7, 2017

Parallelizing Over Artificial Neural Network Training Runs with Multigrid

Jacob B. Schroder

Artificial neural networks are a popular and effective machine learning technique. Great progress has been made parallelizing the expensive training phase of an individual network, leading to highly specialized pieces of hardware, many based on GPU-type architectures, and more concurrent algorithms such as synthetic gradients. However, the training phase continues to be a bottleneck, where the training data must be processed serially over thousands of individual training runs. This work considers a multigrid reduction in time (MGRIT) algorithm that is able to parallelize over the thousands of training runs and converge to the exact same solution as traditional training would provide. MGRIT was originally developed to provide parallelism for time evolution problems that serially step through a finite number of time-steps. This work recasts the training of a neural network similarly, treating neural network training as an evolution equation that evolves the network weights from one step to the next. Thus, this work concerns distributed computing approaches for neural networks, but is distinct from other approaches which seek to parallelize only over individual training runs. The work concludes with supporting numerical results for two model problems.