DCMay 28
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition AccelerationDaniel Pacheco, Leonel Sousa, Aleksandar Ilic
Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning. One of the most used tensor decomposition algorithms is the Alternating Least Squares Canonical Polyadic Decomposition (CP-ALS), where the most time-consuming operation is the Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). This operation is strongly memory-bound, making it hard to implement efficiently on general-purpose processors. This work proposes PRISM, the first approach to tackle this operation using Processing-In-Memory (PIM) technology. We extensively characterize different partitioning strategies, number formats, and kernel optimizations that efficiently adapt this operation to UPMEM PIM, which is further boosted by heterogeneous collaboration with the CPU. The experimental results show that the proposed PIM-based and heterogeneous approaches achieve up to 2.37x and 2.64x speedup compared to state-of-the-art CPU implementations, respectively. However, the UPMEM distributed memory system can significantly hinder performance on certain workloads. Nonetheless, the efficiency of resource consumption for this approach, measured by peak performance fraction usage, is significantly higher than for both CPU and GPU.
ARMay 28
Constant Depth Threshold Circuits For Exhaustive Epistasis DetectionAndré Ribeiro, Aleksandar Ilic, Leonel Sousa
The development of large-scale neuromorphic hardware has made practical implementations of threshold gate-based circuits a near-term possibility. The complexity advantages regarding traditional computing classes, as evidenced in the literature, have prompted us to tackle Epistasis Detection, one of the most computationally complex combinatorial problems in bioinformatics. We propose specially designed circuits that calculate the relative frequencies of all dataset combinations in an efficient pipelined fashion, taking advantage of co-located memory and configurable parallelism, obtaining complexity gains. Overall, we obtain the runtime to be bounded by the number of combinations to calculate, without any additional complexity overhead, contrary to classical approaches, using log-linear space. To accomplish this, we propose a data encoding and combination generation strategy using Leaky Integrate and Fire (LIF) neurons, that feeds a constant depth threshold gate population count circuit. Accounting for typical hardware characteristics, such as limited fan-in and variable precisions, we obtain logarithmic depth and log-cubic linear connections, for the population count circuit by composing developed unbounded fan-in constant depth threshold gate circuits to perform population count and binary array sum.
DCMay 28
CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application AnalysisJosé Morgado, Leonel Sousa, Aleksandar Ilic
In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the Cache-aware Roofline Model (CARM) offer effective guidance by providing insights into bottlenecks that limit the application's ability to reach the system's maximum performance. To fully exploit the benefits of CARM optimization guidance for application development, automatic tools for cross-architecture model construction and in-depth application characterization are absolutely essential. Given a plethora of existing CPU architectures, the current landscape of CARM-enabled tools covers either vendor-specific (Intel Advisor), not sufficiently developed (ARM) or simply non-existing (AMD, RISC-V) tools. This is a particular gap that this work intends to close by bringing automatic CARM support to all major CPU architectures and ISAs, i.e., x86 (Intel, AMD), ARM, and RISC-V, by developing assembly microbenchmarks specifically tailored to cover a full performance spectrum of modern CPUs (from scalar to all supported vector ISA extensions) for both computational units and all memory hierarchy levels. Additionally, this work integrates application analysis within the CARM framework using performance counters and dynamic binary instrumentation. Experimental results show that the CARM roofs constructed with the proposed automated framework provide less than a 1% deviation across various tested architectural maximums.
DCMay 27
TrioSeq: A Novel Approach to Accelerate Triplet Sequence Alignment on GPUsMiguel Graça, Aleksandar Ilic
State-of-the-art multiple sequence alignment (MSA) algorithms are based on progressive approaches that rely on pairwise sequence alignment (PSA) to generate guide trees to align all sequences. Given an evidenced explosion in genomic data availability, research efforts have focused on accelerating PSA on massively-parallel architectures (e.g., GPUs) and specialized hardware (e.g., FPGAs). However, there is increasing evidence that starting from exact 3-way alignments could provide more robust, accurate MSAs, and improve genomic analysis. While the current literature has shown that PSA algorithms can be extended to align sequence triplets, the existent state-of-the-art on hardware acceleration of exact 3-way alignments is still scarce. In particular, current GPU methods are still inefficient due to lacking support for novel hardware features (e.g., cross-thread intrinsics), while being closed-source and vendor-specific. In this paper, TrioSeq is proposed as a fine-grained strategy to efficiently implement 3-way alignments on GPUs, leveraging novel levels of GPU parallelism and synchronization to achieve high throughput in aligning sequence triplets. Evaluation on NVIDIA and AMD GPUs shows that TrioSeq outperforms state-of-the-art GPU progressive methods on 3-way alignment by at least 20% on simulated genomic datasets.