Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms
This provides a practical solution for developers who need to scale stochastic algorithms efficiently on distributed systems, though it builds incrementally on existing platforms like Apache Spark.
The authors tackled the challenge of parallelizing stochastic algorithms for machine learning and optimization by developing Splash, a framework with a user-friendly programming interface and execution engine that automatically parallelizes sequential algorithms. Experiments showed Splash achieved order-of-magnitude speedup over single-thread algorithms and state-of-the-art Spark implementations.
Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without concerning any detail about distributed computing. The algorithm is then automatically parallelized by a communication-efficient execution engine. We provide theoretical justifications on the optimal rate of convergence for parallelizing stochastic gradient descent. Splash is built on top of Apache Spark. The real-data experiments on logistic regression, collaborative filtering and topic modeling verify that Splash yields order-of-magnitude speedup over single-thread stochastic algorithms and over state-of-the-art implementations on Spark.