Leonel Sousa

DC
4papers
23citations
Novelty49%
AI Score46

4 Papers

DCMay 28
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration

Daniel Pacheco, Leonel Sousa, Aleksandar Ilic

Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning. One of the most used tensor decomposition algorithms is the Alternating Least Squares Canonical Polyadic Decomposition (CP-ALS), where the most time-consuming operation is the Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). This operation is strongly memory-bound, making it hard to implement efficiently on general-purpose processors. This work proposes PRISM, the first approach to tackle this operation using Processing-In-Memory (PIM) technology. We extensively characterize different partitioning strategies, number formats, and kernel optimizations that efficiently adapt this operation to UPMEM PIM, which is further boosted by heterogeneous collaboration with the CPU. The experimental results show that the proposed PIM-based and heterogeneous approaches achieve up to 2.37x and 2.64x speedup compared to state-of-the-art CPU implementations, respectively. However, the UPMEM distributed memory system can significantly hinder performance on certain workloads. Nonetheless, the efficiency of resource consumption for this approach, measured by peak performance fraction usage, is significantly higher than for both CPU and GPU.

ARMay 28
Constant Depth Threshold Circuits For Exhaustive Epistasis Detection

André Ribeiro, Aleksandar Ilic, Leonel Sousa

The development of large-scale neuromorphic hardware has made practical implementations of threshold gate-based circuits a near-term possibility. The complexity advantages regarding traditional computing classes, as evidenced in the literature, have prompted us to tackle Epistasis Detection, one of the most computationally complex combinatorial problems in bioinformatics. We propose specially designed circuits that calculate the relative frequencies of all dataset combinations in an efficient pipelined fashion, taking advantage of co-located memory and configurable parallelism, obtaining complexity gains. Overall, we obtain the runtime to be bounded by the number of combinations to calculate, without any additional complexity overhead, contrary to classical approaches, using log-linear space. To accomplish this, we propose a data encoding and combination generation strategy using Leaky Integrate and Fire (LIF) neurons, that feeds a constant depth threshold gate population count circuit. Accounting for typical hardware characteristics, such as limited fan-in and variable precisions, we obtain logarithmic depth and log-cubic linear connections, for the population count circuit by composing developed unbounded fan-in constant depth threshold gate circuits to perform population count and binary array sum.

DCMay 28
CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis

José Morgado, Leonel Sousa, Aleksandar Ilic

In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the Cache-aware Roofline Model (CARM) offer effective guidance by providing insights into bottlenecks that limit the application's ability to reach the system's maximum performance. To fully exploit the benefits of CARM optimization guidance for application development, automatic tools for cross-architecture model construction and in-depth application characterization are absolutely essential. Given a plethora of existing CPU architectures, the current landscape of CARM-enabled tools covers either vendor-specific (Intel Advisor), not sufficiently developed (ARM) or simply non-existing (AMD, RISC-V) tools. This is a particular gap that this work intends to close by bringing automatic CARM support to all major CPU architectures and ISAs, i.e., x86 (Intel, AMD), ARM, and RISC-V, by developing assembly microbenchmarks specifically tailored to cover a full performance spectrum of modern CPUs (from scalar to all supported vector ISA extensions) for both computational units and all memory hierarchy levels. Additionally, this work integrates application analysis within the CARM framework using performance counters and dynamic binary instrumentation. Experimental results show that the CARM roofs constructed with the proposed automated framework provide less than a 1% deviation across various tested architectural maximums.

SPApr 11, 2018
Beamformed Fingerprint Learning for Accurate Millimeter Wave Positioning

João Gante, Gabriel Falcão, Leonel Sousa

With millimeter wave wireless communications, the resulting radiation reflects on most visible objects, creating rich multipath environments, namely in urban scenarios. The radiation captured by a listening device is thus shaped by the obstacles encountered, which carry latent information regarding their relative positions. In this paper, a system to convert the received millimeter wave radiation into the device's position is proposed, making use of the aforementioned hidden information. Using deep learning techniques and a pre-established codebook of beamforming patterns transmitted by a base station, the simulations show that average estimation errors below 10 meters are achievable in realistic outdoors scenarios that contain mostly non-line-of-sight positions, paving the way for new positioning systems.