NAFeb 13, 2017
Acceleration of the Implicit-Explicit Non-hydrostatic Unified Model of the Atmosphere (NUMA) on Manycore ProcessorsDaniel S. Abdi, Francis X. Giraldo, Emil M. Constantinescu et al.
We present the acceleration of an IMplicit-EXplicit (IMEX) non-hydrostatic atmospheric model on manycore processors such as GPUs and Intel's MIC architecture. IMEX time integration methods sidestep the constraint imposed by the Courant-Friedrichs-Lewy condition on explicit methods through corrective implicit solves within each time step. In this work, we implement and evaluate the performance of IMEX on manycore processors relative to explicit methods. Using 3D-IMEX at Courant number C=15 , we obtained a speedup of about 4X relative to an explicit time stepping method run with the maximum allowable C=1. In addition, we demonstrate a much larger speedup of 100X at C=150 using 1D-IMEX due to the unconditional stability of the method in the vertical direction. Several improvements on the IMEX procedure were necessary in order to outperform our results with explicit methods: a) reducing the number of degrees of freedom of the IMEX formulation by forming the Schur complement; b) formulating a horizontally-explicit vertically-implicit (HEVI) 1D-IMEX scheme that has a lower workload and potentially better scalability than 3D-IMEX; c) using high-order polynomial preconditioners to reduce the condition number of the resulting system; d) using a direct solver for the 1D-IMEX method by performing and storing LU factorizations once to obtain a constant cost for any Courant number. Without all of these improvements, explicit time integration methods turned out to be difficult to beat. We discuss in detail the IMEX infrastructure required for formulating and implementing efficient methods on manycore processors. Finally, we validate our results with standard benchmark problems in NWP and evaluate the performance and scalability of the IMEX method using up to 4192 GPUs and 16 Knights Landing processors.
CENov 7, 2017
IMEX HDG-DG: a coupled implicit hybridized discontinuous Galerkin (HDG) and explicit discontinuous Galerkin (DG) approach for shallow water systemsShinhoo Kang, Francis X. Giraldo, Tan Bui-Thanh
We propose IMEX HDG-DG schemes for planar and spherical shallow water systems. Of interest is subcritical flow, where the speed of the gravity wave is faster than that of nonlinear advection. In order to simulate these flows efficiently, we split the governing system into a stiff part describing the gravity wave and a non-stiff part associated with nonlinear advection. The former is discretized implicitly with the HDG method while an explicit Runge-Kutta DG discretization is employed for the latter. The proposed IMEX HDG-DG framework: 1) facilitates high-order solutions both in time and space; 2) avoids overly small time-step sizes; 3) requires only one linear system solve per time stage; 4) relative to DG generates smaller and sparser linear systems while promoting further parallelism. Numerical results of various test cases demonstrate that our methods are comparable to explicit Runge-Kutta DG schemes in terms of accuracy while allowing for much larger time step sizes.
NAJul 15, 2016
A continuous/discontinuous Galerkin solution of the shallow water equations with dynamic viscosity, high-order wetting and drying, and implicit time integrationSimone Marras, Michal A. Kopera, Emil M. Constantinescu et al.
The high-order numerical solution of the non-linear shallow water equations (and of hyperbolic systems in general) is susceptible to unphysical Gibbs oscillations that form in the proximity of strong gradients. The solution to this problem is still an active field of research as no general cure has been found yet. In this paper, we tackle this issue by presenting a dynamically adaptive viscosity based on a residual-based sub-grid scale model that has the following properties: $(i)$ it removes the spurious oscillations in the proximity of strong wave fronts while preserving the overall accuracy and sharpness of the solution. This is possible because of the residual-based definition of the dynamic diffusion coefficient. $(ii)$ For coarse grids, it prevents energy from building up at small wave-numbers. $(iii)$ The model has no tunable parameter. Our interest in the shallow water equations is tied to the simulation of coastal inundation, where a careful handling of the transition between dry and wet surfaces is particularly challenging for high-order Galerkin approximations. In this paper, we extend to a unified continuous/discontinuous Galerkin (CG/DG) framework a very simple, yet effective wetting and drying algorithm originally designed for DG [Xing, Zhang, Shu (2010)]. We show its effectiveness for problems in one and two dimensions on domains of increasing characteristic lengths varying from centimeters to kilometers. Finally, to overcome the time-step restriction incurred by the high-order Galerkin approximation, we advance the equations forward in time via a three stage, second order explicit-first-stage, singly diagonally implicit Runge-Kutta (ESDIRK) time integration scheme. Via ESDIRK, we are able to preserve numerical stability for an advective CFL number 10 times larger than its explicit counterpart.
2.9NAMay 15
GPU Performance of an Entropy-Stable Discontinuous Galerkin Euler Solver with Non-Conservative TermsHenry Waterhouse, Maciej Waruszewski, Lucas C. Wilcox et al.
The entropy-stable discontinuous Galerkin method for compressible Euler equations with buoyancy is implemented on graphics processing unit (GPU) hardware. We measure the performance of the solver on three-dimensional problems: the rising thermal bubble and the baroclinic instability in a channel. On NVIDIA A100 hardware, the solver achieves nearly 70\% of 64-bit floating-point peak performance for the most computationally expensive kernel (volume terms) and significantly reduces the computational overhead typically incurred by two point entropy-stable fluxes in the volume terms. We also present impressive strong and weak scaling performance of the solver and compare to a highly-optimized central processing unit (CPU) code showing that the GPU kernels are a factor of $10\times$ faster and better than $13\times$ more energy efficient than the CPU code. We also show that the solver achieves the expected $2\times$ speedup when run at 32-bit floating-point peak performance. We discuss the different modifications that we implemented to reach the final form of the GPU implementation and measure the performance gain of each of the implementation strategies ranging from reduction in complex operations and memory traffic as well as load balancing. We also extend symmetry-based flux savings to the non-symmetric gravity term, preserving nearly the full factor-of-two speedup achieved for the symmetric flux.
DCNov 5, 2015
Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMAAndreas Müller, Michal A. Kopera, Simone Marras et al.
Numerical weather prediction (NWP) has proven to be computationally challenging due to its inherent multiscale nature. Currently, the highest resolution NWP models use a horizontal resolution of about 10km. In order to increase the resolution of NWP models highly scalable atmospheric models are needed. The Non-hydrostatic Unified Model of the Atmosphere (NUMA), developed by the authors at the Naval Postgraduate School, was designed to achieve this purpose. NUMA is used by the Naval Research Laboratory, Monterey as the engine inside its next generation weather prediction system NEPTUNE. NUMA solves the fully compressible Navier-Stokes equations by means of high-order Galerkin methods (both spectral element as well as discontinuous Galerkin methods can be used). Mesh generation is done using the p4est library. NUMA is capable of running middle and upper atmosphere simulations since it does not make use of the shallow-atmosphere approximation. This paper presents the performance analysis and optimization of the spectral element version of NUMA. The performance at different optimization stages is analyzed using a theoretical performance model as well as measurements via hardware counters. Machine independent optimization is compared to machine specific optimization using BG/Q vector intrinsics. By using vector intrinsics the main computations reach 1.2 PFlops on the entire machine Mira (12% of the theoretical peak performance). The paper also presents scalability studies for two idealized test cases that are relevant for NWP applications. The atmospheric model NUMA delivers an excellent strong scaling efficiency of 99% on the entire supercomputer Mira using a mesh with 1.8 billion grid points. This allows to run a global forecast of a baroclinic wave test case at 3km uniform horizontal resolution and double precision within the time frame required for operational weather prediction.