MLlib: Machine Learning in Apache Spark
This library addresses the need for scalable and accessible machine learning tools for users of Apache Spark, though it is incremental as it builds on existing Spark infrastructure.
The authors introduced MLlib, a distributed machine learning library integrated into Apache Spark, designed to handle large-scale data processing efficiently across various learning settings, with contributions from over 140 developers and comprehensive documentation to facilitate adoption.
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.