Fabio Banchelli

2papers

2 Papers

79.8ARApr 14
EPAC: The Last Dance

Filippo Mantovani, Fabio Banchelli, Pablo Vizcaino et al.

This paper presents EPAC, a RISC-V-based accelerator chip developed within the European Processor Initiative (EPI) as part of a multi-year, multi-partner effort to build a European HPC processor ecosystem. EPAC is implemented in GlobalFoundries 22FDX (GF22FDX) technology, covers an area of 27 sq mm with approximately 0.3 billion transistors, and integrates three distinct RISC-V compute tiles targeting different workload classes: VEC, a vector processing tile for double-precision HPC workloads; STX, a many-core tile optimized for stencil and machine learning computations; and VRP, a variable-precision tile for iterative numerical solvers requiring extended floating-point formats. All tiles are connected through a Coherent Hub Interface (CHI) based network-on-chip with a distributed L2 cache system and communicate with external memory via a SerDes link. The chip was taped out in GF22FDX technology and successfully brought up, with all major IP blocks validated. This paper describes the architecture of each tile and the uncore infrastructure, the integration and physical implementation process, and the board-level bring-up activities. It also reflects on the engineering and coordination lessons learned from a full chip design effort distributed across academic and industrial partners in Europe.

26.1PFMay 15
Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage

Júlia Orteu Aubach, Fabio Banchelli, Marc Clascà Ramírez et al.

This work extends a framework for predicting the performance of High-Performance Computing (HPC) workloads using Machine Learning (ML). A common limitation in performance modeling is the restricted number of hardware counters that can be collected simultaneously. To address this, we propose a heuristic-based methodology to merge execution traces from multiple runs, each instrumented with a different set of hardware counters. Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns. This process enables the construction of a unified dataset that includes a wider set of hardware features without relying on multiplexing. The output is a new synthetic trace with all merged counters, which can be used both for HPC performance prediction and for conventional performance analysis. The methodology has been validated on MareNostrum5 machine with a range of kernels and real applications. Results show that the merged counters maintain acceptable accuracy depending on the application, and can be directly used to train ML models on a richer feature space without prior counter selection.