Mahmoud Jahanshahi

h-index4

4papers

49citations

4 Papers

7.0SEJun 22Code

Ensuring Open Source Integrity: The Intersection of Copy-Based Reuse and License Compliance

Mahmoud Jahanshahi, Bogdan Vasilescu, Audris Mockus

As other creative work, source code is protected by copyright. The owner can license the work, e.g., to permit copy and other kinds of use, and even start legal proceeding against license violators. However, source code can be reused in subtle ways, e.g., via copying without explicit package manager dependencies, making it hard to reason about potential license noncompliance. Using the World of Code infrastructure approximating the entirety of open source software, in this paper we create a copy-based code reuse network mapping direct copying across projects, and use it to quantify the extent of potential license noncompliance across the entire open source ecosystem. In addition, we estimate regression models to understand whether code copying is affected by the origin project's license, and, if so, how it varies with other project characteristics. We find that code in repositories with permissive licenses, such as MIT and Apache, shows higher likelihood of reuse across programming languages. In contrast, copyleft licenses, like the GPL, exhibit mixed effects. Public domain licenses, despite their aim of allowing unrestricted use, are associated with lower likelihood of copy-based reuse. A widespread potential license noncompliance appears to accompany copy-based reuse, with 39.4% of project combinations at potential noncompliance risk, particularly when licenses are unclear or absent. Our findings reveal that only 2.43% of reuse detected through the copy-based network was discoverable via dependency analysis, highlighting the limitations of existing dependency-tracking tools in capturing copy-based reuse.

14.9SEJan 5, 2025Code

Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets

Mahmoud Jahanshahi, Audris Mockus · meta-ai

A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention. We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes in at least one OSS project, and pinpoint a subset of samples that include fixes for bugs or vulnerabilities. We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions, with 17% of those representing bug fixes, including 2.36% addressing known CVEs. The deduplicated version of Stack v2 still includes blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the dataset were never modified after creation, suggesting they likely represent software with minimal or no use. Misidentified blob origins present an additional challenge, as they lead to the inclusion of non-permissively licensed code, raising serious compliance concerns. By addressing these issues, the training of new models can avoid perpetuating buggy code patterns or license violations. We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools.

6.4SEMar 22, 2021Code

Building the Collaboration Graph of Open-Source Software Ecosystem

Elena Lyulina, Mahmoud Jahanshahi

The Open-Source Software community has become the center of attention for many researchers, who are investigating various aspects of collaboration in this extremely large ecosystem. Due to its size, it is difficult to grasp whether or not it has structure, and if so, what it may be. Our hackathon project aims to facilitate the understanding of the developer collaboration structure and relationships among projects based on the bi-graph of what projects developers contribute to by providing an interactive collaboration graph of this ecosystem, using the data obtained from World of Code infrastructure. Our attempts to visualize the entirety of projects and developers were stymied by the inability of the layout and visualization tools to process the exceedingly large scale of the full graph. We used WoC to filter the nodes (developers and projects) and edges (developer contributions to a project) to reduce the scale of the graph that made it amenable to an interactive visualization and published the resulting visualizations. We plan to apply hierarchical approaches to be able to incorporate the entire data in the interactive visualizations and also to evaluate the utility of such visualizations for several tasks.

2.7SEJun 22

The Prevalence and Impact of Licenses in Open Software Projects

Mahmoud Jahanshahi, Bogdan Vasilescu, Audris Mockus

The terms of how publicly available source code can be used are dictated by its license. The license (or its absence), in turn, affects what code the project may reuse and how its code can be (re)used and may also affect external participation and overall activity of the project. We aim to better understand the general state of license distribution overall and within language ecosystems and to investigate if license changes are associated with a noticeable variations of project output. To accomplish that we identify licenses and license types for over 100M software projects and find that most do not contain any license, that permissive licenses represent the bulk of most licenses, and that permissive licensing is representing an increasing proportion of all licenses over time. Restrictive licenses are more likely to be retained, however. There is a great variation among language ecosystems with C-language strongly favoring restrictive licenses. The analysis of license change impact comparing activity within one year of the adoption of the initial and final licenses shows that the change from restrictive to permissive license varies with the ecosystem. C-language ecosystems show reduced activity while Python shows increased activity when comparing restrictive to permissive license transition. Our results demonstrate dramatic changes in license type prevalence over time and find that the effects of license changes may have opposite effects depending on the language ecosystem.