Data Virtualization for Machine Learning
This addresses data management challenges for ML teams dealing with multiple workflows, but it is incremental as it builds on existing virtualization concepts.
The paper tackles the problem of managing intermediate data in concurrent machine learning workflows by designing and implementing a data virtualization service, which currently supports six ML applications with multiple workflows and is scalable for future growth.
Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.