Model Lakes
This addresses the issue for practitioners in machine learning who need to efficiently find, differentiate, and understand models, but it is incremental as it builds on data lakes research.
The paper tackles the problem of managing and understanding large sets of deep learning models, which is challenging due to incomplete documentation and increasing model numbers, by introducing the concept of model lakes to formalize tasks like attribution, search, and benchmarking.
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding models become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We formalize key model lake tasks, including model attribution, versioning, search, and benchmarking, and discuss fundamental research challenges in the management of large models. We also explore what data management techniques can be brought to bear on the study of large model management.