69.4DCMay 29
HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneityAbdullah Al Asif, Patrick Diem, Juan Pablo Muñoz et al.
Distributed Low-Communication (DiLoCo) training reduces communication overhead by allowing workers to perform multiple local optimization steps before sending pseudo-gradients to a global outer update. Its asynchronous variant further improves hardware utilization by removing synchronization barriers, but at the cost of stale pseudo-gradients computed from outdated model states. As a result, these updates can become misaligned with the current global optimization direction, particularly in heterogeneous systems. This issue becomes even more pronounced when data are non-IID, a setting that has not been well studied in asynchronous low-communication training. To address this limitation, we propose \textbf{HeLoCo}, a direction-aware correction method for asynchronous low-communication training that uses outer momentum as a reference for the current optimization trajectory and selectively adjusts incoming pseudo-gradients before the outer update. Updates that remain aligned are preserved, while directionally conflicting components are corrected. On multilingual language-model training with heterogeneous workers and non-IID data, HeLoCo consistently improves validation loss. It outperforms existing asynchronous DiLoCo-based baselines by up to 7.5\% at a fixed token budget, exceeds asynchronous momentum look-ahead by up to 3.3\% at a fixed wall-clock budget, and surpasses the synchronous baseline by up to 22.1\% under severe system heterogeneity. Our analysis further shows how staleness, worker speed, and data heterogeneity shape update quality and convergence in highly decentralized and heterogeneous training setups.
67.3DCJun 2
I Like To Move It -- Computation Instead of Data in the BrainFabian Czappa, Marvin Kaster, Felix Wolf
The detailed functioning of the human brain remains incompletely understood. Large-scale brain simulations complement experimental research but face substantial computational challenges: the human brain comprises approximately $10^{11}$ neurons connected by $10^{14}$ synapses, collectively forming the connectome. Empirical evidence indicates that modifications of the connectome -- specifically the formation and elimination of synapses, referred to as structural plasticity -- are essential for processes such as learning and memory formation. Connectivity updates can be computed efficiently using a Barnes--Hut-inspired approximation that reduces computational complexity from $O(n^2)$ to $O(n \log n)$, where $n$ denotes the number of neurons. Despite this improvement, communication overhead still limits scalability. Synapse updates rely heavily on remote memory access (RMA), and spike transmission requires all-to-all communication at every simulation time step. We introduce a novel algorithm that reduces communication by migrating computation rather than data. This approach reduces connectivity update time by a factor of 6 and spike transmission time by more than 2 orders of magnitude.
NAOct 13, 2017
A Fast Isogeometric BEM for the Three Dimensional Laplace- and Helmholtz ProblemsJürgen Dölz, Helmut Harbrecht, Stefan Kurz et al.
We present an indirect higher order boundary element method utilising NURBS mappings for exact geometry representation and an interpolation-based fast multipole method for compression and reduction of computational complexity, to counteract the problems arising due to the dense matrices produced by boundary element methods. By solving Laplace and Helmholtz problems via a single layer approach we show, through a series of numerical examples suitable for easy comparison with other numerical schemes, that one can indeed achieve extremely high rates of convergence of the pointwise potential through the utilisation of higher order B-spline-based ansatz functions.
CESep 18, 2017
Recent Advances of Isogeometric Analysis in Computational ElectromagneticsZeger Bontinck, Jacopo Corno, Herbert De Gersem et al.
In this communication the advantages and drawbacks of the isogeometric analysis (IGA) are reviewed in the context of electromagnetic simulations. IGA extends the set of polynomial basis functions, commonly employed by the classical Finite Element Method (FEM). While identical to FEM with Nédélec's basis functions in the lowest order case, it is based on B-spline and Non-Uniform Rational B-spline basis functions. The main benefit of this is the exact representation of the geometry in the language of computer aided design (CAD) tools. This simplifies the meshing as the computational mesh is implicitly created by the engineer using the CAD tool. The curl- and div-conforming spline function spaces are recapitulated and the available software is discussed. Finally, several non-academic benchmark examples in two and three dimensions are shown which are used in optimization and uncertainty quantification workflows.
CEAug 9, 2024
A Low-Frequency-Stable Higher-Order Isogeometric Discretization of the Augmented Electric Field Integral EquationMaximilian Nolte, Riccardo Torchio, Sebastian Schöps et al.
This contribution investigates the connection between isogeometric analysis and integral equation methods for full-wave electromagnetic problems up to the low-frequency limit. The proposed spline-based integral equation method allows for an exact representation of the model geometry described in terms of non-uniform rational B-splines without meshing. This is particularly useful when high accuracy is required or when meshing is cumbersome for instance during optimization of electric components. The augmented electric field integral equation is adopted and the deflation method is applied, so the low-frequency breakdown is avoided. The extension to higher-order basis functions is analyzed and the convergence rate is discussed. Numerical experiments on academic and realistic test cases demonstrate the high accuracy of the proposed approach.
7.1DCApr 1
Navigating the Energy Doldrums: Can We Exploit Energy-Price Volatility To Lower the Cost of Computing?Peter Arzt, Felix Wolf
Energy costs are a major factor in the total cost of ownership (TCO) for high-performance computing (HPC) systems. The rise of intermittent green energy sources and reduced reliance on fossil fuels have introduced volatility into electricity markets, complicating energy budgeting. This paper explores variable capacity as a strategy for managing HPC energy costs -- dynamically adjusting compute resources in response to fluctuating electricity prices. While this approach can lower energy expenses, it risks underutilizing costly hardware. To evaluate this trade-off, we present a simple model that helps operators estimate the TCO impact of variable capacity strategies using key system parameters. We apply this model to real data from a university HPC cluster and assess how different scenarios could affect the cost-effectiveness of this approach in the future.
AINov 27, 2025
When AI Bends Metal: AI-Assisted Optimization of Design Parameters in Sheet Metal FormingAhmad Tarraf, Koutaiba Kassem-Manthey, Seyed Ali Mohammadi et al.
Numerical simulations have revolutionized the industrial design process by reducing prototyping costs, design iterations, and enabling product engineers to explore the design space more efficiently. However, the growing scale of simulations demands substantial expert knowledge, computational resources, and time. A key challenge is identifying input parameters that yield optimal results, as iterative simulations are costly and can have a large environmental impact. This paper presents an AI-assisted workflow that reduces expert involvement in parameter optimization through the use of Bayesian optimization. Furthermore, we present an active learning variant of the approach, assisting the expert if desired. A deep learning model provides an initial parameter estimate, from which the optimization cycle iteratively refines the design until a termination condition (e.g., energy budget or iteration limit) is met. We demonstrate our approach, based on a sheet metal forming process, and show how it enables us to accelerate the exploration of the design space while reducing the need for expert involvement.
PLFeb 24, 2021
Learning to Make Compiler Optimizations More EffectiveRahim Mammadli, Marija Selakovic, Felix Wolf et al.
Because loops execute their body many times, compiler developers place much emphasis on their optimization. Nevertheless, in view of highly diverse source code and hardware, compilers still struggle to produce optimal target code. The sheer number of possible loop optimizations, including their combinations, exacerbates the problem further. Today's compilers use hard-coded heuristics to decide when, whether, and which of a limited set of optimizations to apply. Often, this leads to highly unstable behavior, making the success of compiler optimizations dependent on the precise way a loop has been written. This paper presents LoopLearner, which addresses the problem of compiler instability by predicting which way of writing a loop will lead to efficient compiled code. To this end, we train a neural network to find semantically invariant source-level transformations for loops that help the compiler generate more efficient code. Our model learns to extract useful features from the raw source code and predicts the speedup that a given transformation is likely to yield. We evaluate LoopLearner with 1,895 loops from various performance-relevant benchmarks. Applying the transformations that our model deems most favorable prior to compilation yields an average speedup of 1.14x. When trying the top-3 suggested transformations, the average speedup even increases to 1.29x. Comparing the approach with an exhaustive search through all available code transformations shows that LoopLearner helps to identify the most beneficial transformations in several orders of magnitude less time.
LGAug 20, 2020
Static Neural Compiler Optimization via Deep Reinforcement LearningRahim Mammadli, Ali Jannesari, Felix Wolf
The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequences usually fall short of achieving optimal runtime for a given source code and may sometimes even degrade the performance when compared to unoptimized version. In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM's O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training and achieves competitive performance on the validation set, gaining up to 1.32x speedup on previously-unseen programs. Notably, our approach differs from autotuning methods by not depending on one or more test runs of the program for making successful optimization decisions. It has no dependence on any dynamic feature, but only on the statically-attainable intermediate representation of the source code. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents, at first to complement, and eventually replace the hand-crafted optimization sequences.