Fast Factorized Learning: Powered by In-Memory Database Systems
This work is incremental, improving the efficiency of machine learning pipelines for data scientists by leveraging modern database engines.
This paper tackled the problem of accelerating factorized learning by implementing it on in-memory database systems, showing a 70% performance gain over non-factorized learning and a 100x speedup compared to disk-based systems.
Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.