PLDBLGJan 10, 2020

Multi-layer Optimizations for End-to-End Data Analytics

arXiv:2001.03541v119 citations
AI Analysis

This addresses the problem of slow data analytics workflows for data scientists and engineers, offering a significant speedup but is incremental as it builds on existing optimization techniques.

The paper tackles the inefficiency of training machine learning models on multi-relational data by introducing the IFAQ framework, which integrates feature extraction and learning into a single program and applies multi-layer optimizations, resulting in performance improvements of several orders of magnitude over existing tools like mlpack, Scikit, and TensorFlow for linear regression and regression tree models.

We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes