11.6ARApr 22
Enabling Mixed criticality applications for the Versal AI-EnginesVincent Sprave, Martin Wilhelm, Daniele Passaretti et al.
Adaptive Systems-on-Chips (SoCs) are increasingly being used in mixed criticality systems (MCSs), such as in autonomous driving, aviation and medical systems. In this context, AMD has proposed the Versal SoC, which has a heterogeneous architecture including, among other components, an Artificial Intelligence Engine (AIE), which is a 2D array of processors and memory tiles designed for AI and signal processing workloads. While this AIE offers significant potential for accelerating real-time data processing tasks, this has not yet been explored in the context of MCSs since individual tasks with different criticality levels cannot be dynamically assigned to tiles due to the static mapping of dataflow graphs and tasks. In this work, we propose a dynamic task dispatching infrastructure that enables task switching on the AIE at runtime. Based on this infrastructure, we present an MCS design that dynamically assigns tasks of different criticality to a pool of AIE tiles, depending on the criticality mode of the system. Our approach overcomes the limitations of static dataflow graph mappings and, for the first time, exploits the parallel processing capabilities of the AIE for MCSs. We also present a comprehensive timing analysis of the overhead introduced by the task dispatcher infrastructure, focusing on control logic, context switching and data copy operations. This shows that these operations have low variance and are negligible compared to the overall execution time, demonstrating that our infrastructure is suitable for MCSs. Finally, we evaluate the proposed infrastructure using an autonomous driving workload with tasks that have variable execution times and different criticality levels. In this case study, we maximized AIE utilization, reducing idle time by 65.5 %, while measuring an execution time overhead of less than 0.002 %, and doubling the throughput of low-criticality tasks.
DCAug 12, 2022
Modeling Task Mapping for Data-intensive Applications in Heterogeneous SystemsMartin Wilhelm, Hanna Geppert, Anna Drewes et al.
We introduce a new model for the task mapping problem to aid in the systematic design of algorithms for heterogeneous systems including, but not limited to, CPUs, GPUs and FPGAs. A special focus is set on the communication between the devices, its influence on parallel execution, as well as on device-specific differences regarding parallelizability and streamability. We show how this model can be utilized in different system design phases and present two novel mixed-integer linear programs to demonstrate the usage of the model.
41.5ARApr 22
Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAsMax Tzschoppe, Martin Wilhelm, Sven Groppe et al.
This paper introduces a search algorithm for index structures based on a B+ tree, specifically optimized for execution on a field-programmable gate array (FPGA). Our implementation efficiently traverses and reuses tree nodes by processing a batch of search keys level by level. This approach reduces costly global memory accesses, improves reuse of loaded B+ tree nodes, and enables parallel search key comparisons directly on the FPGA. Using a high-level synthesis (HLS) approach, we developed a highly flexible and configurable search kernel design supporting variable batch sizes, customizable node sizes, and arbitrary tree depths. The final design was implemented on an AMD Alveo U250 Data Center Accelerator Card, and was evaluated against the B+ tree search algorithm from the TLX library running on an AMD EPYC 7542 processor (2.9 GHz). With a batch size of 1000 search keys, a B+ tree containing one million entries, and a tree order of 16, we measured a 4.9x speedup for the single-kernel FPGA design compared to a single-threaded CPU implementation. Running four kernel instances in parallel on the FPGA resulted in a 2.1$\times$ performance improvement over a CPU implementation using 16 threads.
DCFeb 27, 2025
Static task mapping for heterogeneous systems based on series-parallel decompositionsMartin Wilhelm, Thilo Pionteck
Modern heterogeneous systems consist of many different processing units, such as CPUs, GPUs, FPGAs and AI units. A central problem in the design of applications in this environment is to find a beneficial mapping of tasks to processing units. While there are various approaches to task mapping, few can deal with high heterogeneity or applications with a high number of tasks and many dependencies. In addition, streaming aspects of FPGAs are generally not considered. We present a new general task mapping principle based on graph decompositions and model-based evaluation that can find beneficial mappings regardless of the complexity of the scenario. We apply this principle to create a high-quality and reasonably efficient task mapping algorithm using series-parallel decompositions. For this, we present a new algorithm to compute a forest of series-parallel decomposition trees for general DAGs. We compare our decomposition-based mapping algorithm with three mixed-integer linear programs, one genetic algorithm and two variations of the Heterogeneous Earliest Finish Time (HEFT) algorithm. We show that our approach can generate mappings that lead to substantially higher makespan improvements than the HEFT variations in complex environments while being orders of magnitude faster than a mapper based on genetic algorithms or integer linear programs.
DCOct 8, 2025
Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable LogicMartin Wilhelm, Franz Freitag, Max Tzschoppe et al.
Heterogeneous computing systems, which combine general-purpose processors with specialized accelerators, are increasingly important for optimizing the performance of modern applications. A central challenge is to decide which parts of an application should be executed on which accelerator or, more generally, how to map the tasks of an application to available devices. Predicting the impact of a change in a task mapping on the overall makespan is non-trivial. While there are very capable simulators, these generally require a full implementation of the tasks in question, which is particularly time-intensive for programmable logic. A promising alternative is to use a purely analytical function, which allows for very fast predictions, but abstracts significantly from reality. Bridging the gap between theory and practice poses a significant challenge to algorithm developers. This paper aims to aid in the development of rapid makespan prediction algorithms by providing a highly flexible evaluation framework for heterogeneous systems consisting of CPUs, GPUs and FPGAs, which is capable of collecting real-world makespan results based on abstract task graph descriptions. We analyze to what extent actual makespans can be predicted by existing analytical approaches. Furthermore, we present common challenges that arise from high-level characteristics such as data transfer overhead and device congestion in heterogeneous systems.