A Scalable and Cloud-Native Hyperparameter Tuning System
This system addresses the needs of both users and administrators in machine learning workflows, offering multi-tenancy, scalability, fault-tolerance, and extensibility, but it is incremental as it builds upon existing hyperparameter tuning systems.
The paper tackles the problem of hyperparameter tuning by introducing Katib, a scalable and cloud-native system that is framework-agnostic and production-ready, demonstrating its advantages through experimental results and real-world use cases.
In this paper, we introduce Katib: a scalable, cloud-native, and production-ready hyperparameter tuning system that is agnostic of the underlying machine learning framework. Though there are multiple hyperparameter tuning systems available, this is the first one that caters to the needs of both users and administrators of the system. We present the motivation and design of the system and contrast it with existing hyperparameter tuning systems, especially in terms of multi-tenancy, scalability, fault-tolerance, and extensibility. It can be deployed on local machines, or hosted as a service in on-premise data centers, or in private/public clouds. We demonstrate the advantage of our system using experimental results as well as real-world, production use cases. Katib has active contributors from multiple companies and is open-sourced at \emph{https://github.com/kubeflow/katib} under the Apache 2.0 license.