DCCYPFSEOct 13, 2020

Data Engineering for HPC with Python

arXiv:2010.06312v114 citations
Originality Synthesis-oriented
AI Analysis

This work addresses data engineering bottlenecks for scientific researchers using HPC, but it is incremental as it builds on existing table-based approaches with performance optimizations.

The paper tackles the challenge of data engineering for scientific machine learning by presenting a distributed Python API based on table abstraction, which uses C++ kernels and MPI for high-performance processing on HPC clusters, achieving unspecified performance gains.

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes