PeaTMOSS: Mining Pre-Trained Models in Open-Source Software
This dataset addresses the problem of enabling research into software engineering practices for AI developers, but it is incremental as it builds on existing data collection efforts.
The authors tackled the lack of understanding of software engineering behaviors and challenges in reusing pre-trained deep learning models by creating the PeaTMOSS dataset, which includes 281,638 PTMs, 27,270 repositories, and mappings to enable study.
Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.