LGApr 3, 2023

X-TIME: An in-memory engine for accelerating machine learning on tabular data with CAMs

arXiv:2304.01285v410 citationsh-index: 11
Originality Highly original
AI Analysis

This work addresses the inference latency and energy efficiency problem for data scientists and researchers using tree-based models like XGBoost and CatBoost on tabular data, representing a novel hardware acceleration approach rather than an incremental improvement.

The paper tackles the problem of accelerating inference for tree-based machine learning models on tabular data, which are often more accurate than deep learning for this domain but lack hardware acceleration, by developing an analog-digital architecture with a novel increased precision analog CAM and programmable chip, achieving state-of-the-art accuracy and 119x higher throughput at 9740x lower latency with >150x improved energy efficiency compared to a GPU.

Structured, or tabular, data is the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based Machine Learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of machine learning. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests. In this work, we develop an analog-digital architecture that implements a novel increased precision analog CAM and a programmable chip for inference of state-of-the-art tree-based ML models, such as XGBoost, CatBoost, and others. Thanks to hardware-aware training, X-TIME reaches state-of-the-art accuracy and 119x higher throughput at 9740x lower latency with >150x improved energy efficiency compared with a state-of-the-art GPU for models with up to 4096 trees and depth of 8, with a 19W peak power consumption.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes