Harald Köstler

CE
h-index33
13papers
73citations
Novelty46%
AI Score51

13 Papers

DCMay 31
Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Bole Ma, Jan Eitzinger, Harald Köstler et al.

Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents query one large codebase, reusing the same blocks. When that corpus outgrows one GPU it is partitioned across instances, so a query and the blocks it selects often sit on different GPUs: answering it means attention across instances. The reflex of prior cross-instance KV systems is to move the cache: pull the selected blocks to the requester. Multi-head Latent Attention inverts the arithmetic, compressing each token's key and value into one narrow vector, so a routed query row is only ~1 KB, smaller than the chunk it attends; routing the query is then often cheaper than moving the cache. Which primitive wins, over which fabric and request shape, is uncharted, least of all on device-initiated RDMA that makes per-request cross-node transfers cheap. We characterize cross-instance MLA attention on a real multi-node H100 cluster, distilling two reusable artifacts: a topology-aware cost model (probe / transfer / compute / return / merge) and a closed-form route/fetch/local predicate, whose constants we measure on real IBGDA, where the model tracks batched round-trips to within ~7%. At decode it routes the query, trading the cost of moving the cache (a ~3 ms re-adaptation splice for a contiguous chunk, or a scattered gather under selection) for a tens-of-microsecond round trip, and picks the fabric by probe latency, not peak bandwidth. We instantiate the cost model and predicate for MLA, but neither is MLA-specific: they apply wherever compression or sparse selection shrinks attention to small chunks (DeepSeek-V3.2, V4, and GLM-5.1 today). Extending them to a new architecture requires measuring just two coefficients: the routed payload and fetch's move-the-cache cost.

CEJun 16, 2023
AI Driven Near Real-time Locational Marginal Pricing Method: A Feasibility and Robustness Study

Naga Venkata Sai Jitin Jami, Juraj Kardoš, Olaf Schenk et al.

Accurate price predictions are essential for market participants in order to optimize their operational schedules and bidding strategies, especially in the current context where electricity prices become more volatile and less predictable using classical approaches. The Locational Marginal Pricing (LMP) pricing mechanism is used in many modern power markets, where the traditional approach utilizes optimal power flow (OPF) solvers. However, for large electricity grids this process becomes prohibitively time-consuming and computationally intensive. Machine learning (ML) based predictions could provide an efficient tool for LMP prediction, especially in energy markets with intermittent sources like renewable energy. This study evaluates the performance of popular machine learning and deep learning models in predicting LMP on multiple electricity grids. The accuracy and robustness of these models in predicting LMP is assessed considering multiple scenarios. The results show that ML models can predict LMP 4-5 orders of magnitude faster than traditional OPF solvers with 5-6\% error rate, highlighting the potential of ML models in LMP prediction for large-scale power models with the assistance of hardware infrastructure like multi-core CPUs and GPUs in modern HPC clusters.

DCMay 7
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Bole Ma, Jan Eitzinger, Harald Köstler

Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free $c_{KV}$ and a 64-dim $k_r$ correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a $δ$-rotation rule for $k_r$. We evaluate three native MLA-MoE deployments - DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B) - with output-consistency on all three and recovery measured on the two endpoints; Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while delivering 63% prefill energy savings per cache hit. We argue that content-addressed caching belongs in the serving stack as a first-class primitive, not a retrofit over prefix matching.

CLFeb 11
SteuerLLM: Local specialized large language model for German tax law analysis

Sebastian Wind, Jeta Sopa, Laurin Schmid et al.

Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.

NAApr 27, 2022
Evolving Generalizable Multigrid-Based Helmholtz Preconditioners with Grammar-Guided Genetic Programming

Jonas Schmitt, Harald Köstler

Solving the indefinite Helmholtz equation is not only crucial for the understanding of many physical phenomena but also represents an outstandingly-difficult benchmark problem for the successful application of numerical methods. Here we introduce a new approach for evolving efficient preconditioned iterative solvers for Helmholtz problems with multi-objective grammar-guided genetic programming. Our approach is based on a novel context-free grammar, which enables the construction of multigrid preconditioners that employ a tailored sequence of operations on each discretization level. To find solvers that generalize well over the given domain, we propose a custom method of successive problem difficulty adaption, in which we evaluate a preconditioner's efficiency on increasingly ill-conditioned problem instances. We demonstrate our approach's effectiveness by evolving multigrid-based preconditioners for a two-dimensional indefinite Helmholtz problem that outperform several human-designed methods for different wavenumbers up to systems of linear equations with more than a million unknowns.

CLMay 5
Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa et al.

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

LGApr 9, 2025
Benchmarking Convolutional Neural Network and Graph Neural Network based Surrogate Models on a Real-World Car External Aerodynamics Dataset

Sam Jacob Jacob, Markus Mrosek, Carsten Othmer et al.

Aerodynamic optimization is crucial for developing eco-friendly, aerodynamic, and stylish cars, which requires close collaboration between aerodynamicists and stylists, a collaboration impaired by the time-consuming nature of aerodynamic simulations. Surrogate models offer a viable solution to reduce this overhead, but they are untested in real-world aerodynamic datasets. We present a comparative evaluation of two surrogate modeling approaches for predicting drag on a real-world dataset: a Convolutional Neural Network (CNN) model that uses a signed distance field as input and a commercial tool based on Graph Neural Networks (GNN) that directly processes a surface mesh. In contrast to previous studies based on datasets created from parameterized geometries, our dataset comprises 343 geometries derived from 32 baseline vehicle geometries across five distinct car projects, reflecting the diverse, free-form modifications encountered in the typical vehicle development process. Our results show that the CNN-based method achieves a mean absolute error of 2.3 drag counts, while the GNN-based method achieves 3.8. Both methods achieve approximately 77% accuracy in predicting the direction of drag change relative to the baseline geometry. While both methods effectively capture the broader trends between baseline groups (set of samples derived from a single baseline geometry), they struggle to varying extents in capturing the finer intra-baseline group variations. In summary, our findings suggest that aerodynamicists can effectively use both methods to predict drag in under two minutes, which is at least 600 times faster than performing a simulation. However, there remains room for improvement in capturing the finer details of the geometry.

CEDec 8, 2024
Evolving Algebraic Multigrid Methods Using Grammar-Guided Genetic Programming

Dinesh Parthasarathy, Wayne Bradford Mitchell, Harald Köstler

Multigrid methods despite being known to be asymptotically optimal algorithms, depend on the careful selection of their individual components for efficiency. Also, they are mostly restricted to standard cycle types like V-, F-, and W-cycles. We use grammar rules to generate arbitrary-shaped cycles, wherein the smoothers and their relaxation weights are chosen independently at each step within the cycle. We call this a flexible multigrid cycle. These flexible cycles are used in Algebraic Multigrid (AMG) methods with the help of grammar rules and optimized using genetic programming. The flexible AMG methods are implemented in the software library of hypre, and the programs are optimized separately for two cases: a standalone AMG solver for a 3D anisotropic problem and an AMG preconditioner with conjugate gradient for a multiphysics code. We observe that the optimized flexible cycles provide higher efficiency and better performance than the standard cycle types.

LGSep 21, 2025
PMRT: A Training Recipe for Fast, 3D High-Resolution Aerodynamic Prediction

Sam Jacob Jacob, Markus Mrosek, Carsten Othmer et al.

The aerodynamic optimization of cars requires close collaboration between aerodynamicists and stylists, while slow, expensive simulations remain a bottleneck. Surrogate models have been shown to accurately predict aerodynamics within the design space for which they were trained. However, many of these models struggle to scale to higher resolutions because of the 3D nature of the problem and data scarcity. We propose Progressive Multi-Resolution Training (PMRT), a probabilistic multi-resolution training schedule that enables training a U-Net to predict the drag coefficient ($c_d$) and high-resolution velocity fields (512 x 128 x 128) in 24 hours on a single NVIDIA H100 GPU, 7x cheaper than the high-resolution-only baseline, with similar accuracy. PMRT samples batches from three resolutions based on probabilities that change during training, starting with an emphasis on lower resolutions and gradually shifting toward higher resolutions. Since this is a training methodology, it can be adapted to other high-resolution-focused backbones. We also show that a single model can be trained across five datasets from different solvers, including a real-world dataset, by conditioning on the simulation parameters. In the DrivAerML dataset, our models achieve a $c_d$ $R^2$ of 0.975, matching literature baselines at a fraction of the training cost.

CLAug 1, 2025
Agentic large language models improve retrieval-based radiology question answering

Sebastian Wind, Jeta Sopa, Daniel Truhn et al.

Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose radiology Retrieval and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from RaR (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of RaR to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full RaR framework are publicly available to support open research and clinical translation.

CEDec 11, 2024
Towards Automated Algebraic Multigrid Preconditioner Design Using Genetic Programming for Large-Scale Laser Beam Welding Simulations

Dinesh Parthasarathy, Tommaso Bevilacqua, Martin Lanser et al.

Multigrid methods are asymptotically optimal algorithms ideal for large-scale simulations. But, they require making numerous algorithmic choices that significantly influence their efficiency. Unlike recent approaches that learn optimal multigrid components using machine learning techniques, we adopt a complementary strategy here, employing evolutionary algorithms to construct efficient multigrid cycles from available individual components. This technology is applied to finite element simulations of the laser beam welding process. The thermo-elastic behavior is described by a coupled system of time-dependent thermo-elasticity equations, leading to nonlinear and ill-conditioned systems. The nonlinearity is addressed using Newton's method, and iterative solvers are accelerated with an algebraic multigrid (AMG) preconditioner using hypre BoomerAMG interfaced via PETSc. This is applied as a monolithic solver for the coupled equations. To further enhance solver efficiency, flexible AMG cycles are introduced, extending traditional cycle types with level-specific smoothing sequences and non-recursive cycling patterns. These are automatically generated using genetic programming, guided by a context-free grammar containing AMG rules. Numerical experiments demonstrate the potential of these approaches to improve solver performance in large-scale laser beam welding simulations.

LGAug 10, 2021
Known Operator Learning and Hybrid Machine Learning in Medical Imaging -- A Review of the Past, the Present, and the Future

Andreas Maier, Harald Köstler, Marco Heisig et al.

In this article, we perform a review of the state-of-the-art of hybrid machine learning in medical imaging. We start with a short summary of the general developments of the past in machine learning and how general and specialized approaches have been in competition in the past decades. A particular focus will be the theoretical and experimental evidence pro and contra hybrid modelling. Next, we inspect several new developments regarding hybrid machine learning with a particular focus on so-called known operator learning and how hybrid approaches gain more and more momentum across essentially all applications in medical imaging and medical image analysis. As we will point out by numerous examples, hybrid models are taking over in image reconstruction and analysis. Even domains such as physical simulation and scanner and acquisition design are being addressed using machine learning grey box modelling approaches. Towards the end of the article, we will investigate a few future directions and point out relevant areas in which hybrid modelling, meta learning, and other domains will likely be able to drive the state-of-the-art ahead.

NAOct 7, 2019
Optimizing Geometric Multigrid Methods with Evolutionary Computation

Jonas Schmitt, Sebastian Kuckuk, Harald Köstler

For many linear and nonlinear systems that arise from the discretization of partial differential equations the construction of an efficient multigrid solver is a challenging task. Here we present a novel approach for the optimization of geometric multigrid methods that is based on evolutionary computation, a generic program optimization technique inspired by the principle of natural evolution. A multigrid solver is represented as a tree of mathematical expressions which we generate based on a tailored grammar. The quality of each solver is evaluated in terms of convergence and compute performance using automated local Fourier analysis (LFA) and roofline performance modeling, respectively. Based on these objectives a multi-objective optimization is performed using strongly typed genetic programming with a non-dominated sorting based selection. To evaluate the model-based prediction and to target concrete applications, scalable implementations of an evolved solver can be automatically generated with the ExaStencils framework. We demonstrate our approach by constructing multigrid solvers for the steady-state heat equation with constant and variable coefficients that consistently perform better than common V- and W-cycles.