Andreas Herten

9papers

62citations

Novelty32%

AI Score49

Ranked #44,975 of 205,806 authors (top 22%)#185 in DC (top 16%)

9 Papers

36.8QUANT-PHMay 20

Universal Quantum Computer Simulation of 50 Qubits on Europe`s First Exascale Supercomputer Harnessing Its Heterogeneous CPU-GPU Architecture

Hans De Raedt, Jiri Kraus, Andreas Herten et al.

We have developed a new version of the high-performance Jülich universal quantum computer simulator (JUQCS-50) that leverages key features of the GH200 superchips as used in the JUPITER supercomputer, enabling simulations of a 50-qubit universal quantum computer for the first time. JUQCS-50 achieves this through three key innovations: (1) extending usable memory beyond GPU limits via high-bandwidth CPU-GPU interconnects and LPDDR5 memory; (2) adaptive data encoding to reduce memory footprint with acceptable trade-offs in precision and compute effort; and (3) an on-the-fly network traffic optimizer. These advances result in a 16.6-fold speedup over the previous 48-qubit record on the K computer

35.1DCMar 25Code

Efficient Accelerated Graph Edit Distance Computation on GPU

Adel Dabah, Andreas Herten

Graph representation is a powerful abstraction of real-world objects and relations. Computing the Graph Edit Distance (GED) between graphs is critical in domains such as bioinformatics, machine learning, and pattern recognition. GED measures the minimum number of edit operations required to transform one graph into another. However, the high computational complexity of optimal and near-optimal methods limits their applicability to large-scale graphs, making high-performance parallel GED computation essential. To address this, we propose FAST-GED, a fast and scalable open-source framework for GED computation on GPUs. FAST-GED overcomes existing limitations by combining high accuracy with fast execution through GPU-friendly algorithmic design and efficient mapping to GPU hardware, minimizing host-device communication. The implementation is optimized and tested across multiple GPU architectures. We validate FAST-GED on real and synthetic datasets with diverse graph sizes and densities. It achieves speedups of several orders of magnitude over the Python NetworkX library while reaching optimal solutions in most cases. Moreover, it outperforms state-of-the-art approximate methods in both accuracy and scalability. We show that FAST-GED enables broader adoption of GED-based solutions in real-world applications.

CLSep 30, 2024

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Mehdi Ali, Michael Fromm, Klaudia Thellmann et al.

We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.

32.9DCMar 24

Scaled Block Vecchia Approximation for High-Dimensional Gaussian Process Emulation on GPUs

Qilong Pan, Sameh Abdulah, Mustafa Abduljabbar et al.

Emulating computationally intensive scientific simulations is crucial for enabling uncertainty quantification, optimization, and informed decision-making at scale. Gaussian Processes (GPs) offer a flexible and data-efficient foundation for statistical emulation, but their poor scalability limits applicability to large datasets. We introduce the Scaled Block Vecchia (SBV) algorithm for distributed GPU-based systems. SBV integrates the Scaled Vecchia approach for anisotropic input scaling with the Block Vecchia (BV) method to reduce computational and memory complexity while leveraging GPU acceleration techniques for efficient linear algebra operations. To the best of our knowledge, this is the first distributed implementation of any Vecchia-based GP variant. Our implementation employs MPI for inter-node parallelism and the MAGMA library for GPU-accelerated batched matrix computations. We demonstrate the scalability and efficiency of the proposed algorithm through experiments on synthetic and real-world workloads, including a 50M point simulation from a respiratory disease model. SBV achieves near-linear scalability on up to 512 A100 and GH200 GPUs, handles 2.56B points, and reduces energy use relative to exact GP solvers, establishing SBV as a scalable and energy-efficient framework for emulating large-scale scientific models on GPU-based distributed systems.

ARSep 19, 2024

Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML

Chelsea Maria John, Stepan Nassyr, Carolin Penke et al.

The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.

36.6DCMay 7

Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project

Carolin Penke, Chelsea Maria John, Jan Ebert et al.

The training of large language models (LLMs) requires substantial computational resources, complex software stacks, and carefully designed workflows to achieve scalability and efficiency. This report presents best practices and insights gained from the OpenGPT-X project, a German initiative focused on developing open, multilingual LLMs optimized for European languages. We detail the use of high-performance computing (HPC) systems, primarily JUWELS Booster at JSC, for training Teuken-7B, a 7-billion-parameter transformer model. The report covers system architecture, training infrastructure, software choices, profiling and benchmarking tools, as well as engineering and operational challenges. It includes measured throughput data of various configurations of 3D parallelism during training and the impact of features such as flash attention.

46.3DCMar 23

exaCB: Reproducible Continuous Benchmark Collections at Scale Leveraging an Incremental Approach

Jayesh Badwaik, Mathis Bode, Michal Rajski et al.

The increasing heterogeneity of high-performance computing (HPC) systems and the transition to exascale architectures require systematic and reproducible performance evaluation across diverse workloads. While continuous integration (CI) ensures functional correctness in software engineering, performance and energy efficiency in HPC are typically evaluated outside CI workflows, motivating continuous benchmarking (CB) as a complementary approach. Integrating benchmarking into CI workflows enables reproducible evaluation, early detection of regressions, and continuous validation throughout the software development lifecycle. We present exaCB, a framework for continuous benchmarking developed in the context of the JUPITER exascale system. exaCB enables application teams to integrate benchmarking into their workflows while supporting large-scale, system-wide studies through reusable CI/CD components, established harnesses, and a shared reporting protocol. The framework supports incremental adoption, allowing benchmarks to be onboarded easily and to evolve from basic runnability to more advanced instrumentation and reproducibility. The approach is demonstrated in JUREAP, the early-access program for JUPITER, where exaCB enabled continuous benchmarking of over 70 applications at varying maturity levels, supporting cross-application analysis, performance tracking, and energy-aware studies. These results illustrate the practicality using exaCB for continuous benchmarking for exascale HPC systems across large, diverse collections of scientific applications.

63.2DCMar 12

High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors

Ruimin Shi, Gabin Schieffer, Pei-Hung Lin et al.

ARM SVE and RISC-V RVV are emerging vector architectures in high-end processors that support vectorization of flexible vector length. In this work, we leverage an important workload for quantum computing, quantum state-vector simulations, to understand whether high-performance portability can be achieved in a vector-length agnostic (VLA) design. We propose a VLA design and optimization techniques critical for achieving high performance, including VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation. We provide an implementation in Google's Qsim and evaluate five quantum circuits of up to 36 qubits on three ARM processors, including NVIDIA Grace, AWS Graviton3, and Fujitsu A64FX. By defining new metrics and PMU events to quantify vectorization activities, we draw generic insights for future VLA designs. Our single-source implementation of VLA quantum simulations achieves up to 4.5x speedup on A64FX, 2.5x speedup on Grace, and 1.5x speedup on Graviton.

DCJun 30, 2021

JUWELS Booster -- A Supercomputer for Large-Scale AI Research

Stefan Kesselheim, Andreas Herten, Kai Krajsek et al.

In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance. We exemplify its potential for research application by presenting large-scale AI research highlights from various scientific fields that require such a facility.