LG MLNov 5, 2018

Parallel training of linear models without compromising convergence

Nikolas Ioannou, Celestine Dünner, Kornilios Kourtis, Thomas Parnell

arXiv:1811.01564v21.51 citations

Originality Incremental advance

AI Analysis

This work addresses performance bottlenecks in machine learning training for practitioners using multi-core CPUs, though it is incremental as it builds on existing asynchronous parallel algorithms.

The paper tackles the problem of slow convergence in parallel training of linear models by optimizing system-level bottlenecks and introducing dynamic data partitioning, achieving up to 42x speedup compared to state-of-the-art implementations.

In this paper we analyze, evaluate, and improve the performance of training generalized linear models on modern CPUs. We start with a state-of-the-art asynchronous parallel training algorithm, identify system-level performance bottlenecks, and apply optimizations that improve data parallelism, cache line locality, and cache line prefetching of the algorithm. These modifications reduce the per-epoch run-time significantly, but take a toll on algorithm convergence in terms of the required number of epochs. To alleviate these shortcomings of our systems-optimized version, we propose a novel, dynamic data partitioning scheme across threads which allows us to approach the convergence of the sequential version. The combined set of optimizations result in a consistent bottom line speedup in convergence of up to 12x compared to the initial asynchronous parallel training algorithm and up to 42x, compared to state of the art implementations (scikit-learn and h2o) on a range of multi-core CPU architectures.

View on arXiv PDF

Similar