SEJul 10, 2019
Identifying Algorithm Names in Code CommentsJakapong Klainongsuang, Yusuf Sulistyo Nugroho, Hideaki Hata et al.
For recent machine-learning-based tasks like API sequence generation, comment generation, and document generation, large amount of data is needed. When software developers implement algorithms in code, we find that they often mention algorithm names in code comments. Code annotated with such algorithm names can be valuable data sources. In this paper, we propose an automatic method of algorithm name identification. The key idea is extracting important N-gram words containing the word `algorithm' in the last. We also consider part of speech patterns to derive rules for appropriate algorithm name identification. The result of our rule evaluation produced high precision and recall values (more than 0.70). We apply our rules to extract algorithm names in a large amount of comments from active FLOSS projects written in seven programming languages, C, C++, Java, JavaScript, Python, PHP, and Ruby, and report commonly mentioned algorithm names in code comments.
SEMay 9, 2019
A Topological Analysis of Communication Channels for Knowledge Sharing in Contemporary GitHub ProjectsJirateep Tantisuwankul, Yusuf Sulistyo Nugroho, Raula Gaikovina Kula et al.
With over 28 million developers, success of the GitHub collaborative platform is highlighted through an abundance of communication channels among contemporary software projects. Knowledge is broken into two forms and its sharing (through communication channels) can be described as externalization or combination by the SECI model. Such platforms have revolutionized the way developers work, introducing new channels to share knowledge in the form of pull requests, issues and wikis. It is unclear how these channels capture and share knowledge. In this research, our goal is to analyze these communication channels in GitHub. First, using the SECI model, we are able to map how knowledge is shared through the communication channels. Then in a large-scale topology analysis of seven library package projects (i.e., involving over 70 thousand projects), we extracted insights of the different communication channels within GitHub. Using two research questions, we explored the evolution of the channels and adoption of channels by both popular and unpopular library package projects. Results show that (i) contemporary GitHub Projects tend to adopt multiple communication channels, (ii) communication channels change over time and (iii) communication channels are used to both capture new knowledge (i.e., externalization) and updating existing knowledge (i.e., combination).
SEOct 2, 2017
Extracting Insights from the Topology of the JavaScript Package EcosystemNuttapon Lertwittayatrai, Raula Gaikovina Kula, Saya Onoue et al.
Software ecosystems have had a tremendous impact on computing and society, capturing the attention of businesses, researchers, and policy makers alike. Massive ecosystems like the JavaScript node package manager (npm) is evidence of how packages are readily available for use by software projects. Due to its high-dimension and complex properties, software ecosystem analysis has been limited. In this paper, we leverage topological methods in visualize the high-dimensional datasets from a software ecosystem. Topological Data Analysis (TDA) is an emerging technique to analyze high-dimensional datasets, which enables us to study the shape of data. We generate the npm software ecosystem topology to uncover insights and extract patterns of existing libraries by studying its localities. Our real-world example reveals many interesting insights and patterns that describes the shape of a software ecosystem.