DBLGApr 7, 2021

Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python

arXiv:2104.03224v24 citations
AI Analysis

This work addresses efficient machine learning for big data and larger-than-memory datasets by enabling in-database processing, though it appears incremental in improving existing SQL-based ML approaches.

The paper tackles in-database machine learning by developing a SQL code generation method in Python with a novel discretization approach, achieving multidimensional probability estimation that was the fastest among tested algorithms while being only 1-2% less accurate than top methods like decision trees.

Following an analysis of the advantages of SQL-based Machine Learning (ML) and a short literature survey of the field, we describe a novel method for In-Database Machine Learning (IDBML). We contribute a process for SQL-code generation in Python using template macros in Jinja2 as well as the prototype implementation of the process. We describe our implementation of the process to compute multidimensional histogram (MDH) probability estimation in SQL. For this, we contribute and implement a novel discretization method called equal quantized rank binning (EQRB) and equal-width binning (EWB). Based on this, we provide data gathered in a benchmarking experiment for the quantitative empirical evaluation of our method and system using the Covertype dataset. We measured accuracy and computation time and compared it to Scikit Learn state of the art classification algorithms. Using EWB, our multidimensional probability estimation was the fastest of all tested algorithms, while being only 1-2% less accurate than the best state of the art methods found (decision trees and random forests). Our method was significantly more accurate than Naive Bayes, which assumes independent one-dimensional probabilities and/or densities. Also, our method was significantly more accurate and faster than logistic regression. This motivates for further research in accuracy improvement and in IDBML with SQL code generation for big data and larger-than-memory datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes