Hèctor Martínez

h-index10

4papers

48citations

Novelty35%

AI Score36

Ranked #97,807 of 194,257 authors (top 50%)#18,295 in CL (top 59%)

4 Papers

0.5CLOct 31, 2023Code

Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

Guillermo Alaejos, Adrián Castelló, Pedro Alonso-Jordá et al.

We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. % In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves portability, maintainability and, globally, streamlines the software life cycle; 2)~provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and 3)~features a small memory footprint.

3.3MSOct 26, 2023Code

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

Adrián Castelló, Julian Bellavita, Grace Dinh et al.

The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the Exo compiler that performs close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.

6.1MLJun 4, 2015Code

The Preference Learning Toolbox

Vincent E. Farrugia, Héctor P. Martínez, Georgios N. Yannakakis

Preference learning (PL) is a core area of machine learning that handles datasets with ordinal relations. As the number of generated data of ordinal nature is increasing, the importance and role of the PL field becomes central within machine learning research and practice. This paper introduces an open source, scalable, efficient and accessible preference learning toolbox that supports the key phases of the data training process incorporating various popular data preprocessing, feature selection and preference learning methods.

4.9CLJun 13, 2025

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Héctor Martínez, Adrián Castelló, Francisco D. Igual et al.

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today's specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the "Cambrian period" for matrix multiplication.