VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data
This work addresses efficiency and robustness issues in ML lifecycle management for practitioners handling large-scale, high-dimensional data, representing an incremental improvement over existing systems.
The paper tackles the high cost and accuracy degradation in building end-to-end machine learning lifecycles for large-scale, high-dimensional data by introducing VeML, a version management system that transfers lifecycles from similar datasets and detects data mismatches without labeled testing data, showing promising results on real-world datasets like driving images and spatiotemporal sensor data.
An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.