Diomidis Spinellis

h-index49

16papers

467citations

Novelty31%

AI Score41

Ranked #69,363 of 194,257 authors (top 36%)#706 in SE (top 23%)

16 Papers

1.7SESep 12, 2023Code

Commands as AI Conversations

Diomidis Spinellis

Developers and data scientists often struggle to write command-line inputs, even though graphical interfaces or tools like ChatGPT can assist. The solution? "ai-cli," an open-source system inspired by GitHub Copilot that converts natural language prompts into executable commands for various Linux command-line tools. By tapping into OpenAI's API, which allows interaction through JSON HTTP requests, "ai-cli" transforms user queries into actionable command-line instructions. However, integrating AI assistance across multiple command-line tools, especially in open source settings, can be complex. Historically, operating systems could mediate, but individual tool functionality and the lack of a unified approach have made centralized integration challenging. The "ai-cli" tool, by bridging this gap through dynamic loading and linking with each program's Readline library API, makes command-line interfaces smarter and more user-friendly, opening avenues for further enhancement and cross-platform applicability.

12.6SENov 17, 2022

Machine Learning for Software Engineering: A Tertiary Study

Zoe Kotti, Rafaila Galanopoulou, Diomidis Spinellis

Machine learning (ML) techniques increase the effectiveness of software engineering (SE) lifecycle activities. We systematically collected, quality-assessed, summarized, and categorized 83 reviews in ML for SE published between 2009-2022, covering 6,117 primary studies. The SE areas most tackled with ML are software quality and testing, while human-centered areas appear more challenging for ML. We propose a number of ML for SE research challenges and actions including: conducting further empirical validation and industrial studies on ML; reconsidering deficient SE methods; documenting and automating data collection and pipeline processes; reexamining how industrial practitioners distribute their proprietary data; and implementing incremental ML approaches.

5.3CRMar 31

An Empirical Comparison of Security and Privacy Characteristics of Android Messaging Apps

Ioannis Karyotakis, Foivos Timotheos Proestakis, Evangelos Talos et al.

Mobile messaging apps are a fundamental communication infrastructure, used by billions of people every day to share information, including sensitive data. Security and Privacy are thus critical concerns for such applications. Although the cryptographic protocols prevalent in messaging apps are generally well studied, other relevant implementation characteristics of such apps, such as their software architecture, permission use, and network-related runtime behavior, have not received enough attention. In this paper, we present a methodology for comparing implementation characteristics of messaging applications by employing static and dynamic analysis under reproducible scenarios to identify discrepancies with potential security and privacy implications. We apply this methodology to study the Android clients of the Meta Messenger, Signal, and Telegram apps. Our main findings reveal discrepancies in application complexity, attack surface, and network behavior. Statically, Messenger presents the largest attack surface and the highest number of static analysis warnings, while Telegram requests the most dangerous permissions. In contrast, Signal consistently demonstrates a minimalist design with the fewest dependencies and dangerous permissions. Dynamically, these differences are reflected in network activity; Messenger is by far the most active, exhibiting persistent background communication, whereas Signal is the least active. Furthermore, our analysis shows that all applications properly adhere to the Android permission model, with no evidence of unauthorized data access.

5.3SEMay 14, 2020Code

Identifying Bugs in Make and JVM-Oriented Builds

Thodoris Sotiropoulos, Stefanos Chaliasos, Dimitris Mitropoulos et al.

Incremental and parallel builds are crucial features of modern build systems. Parallelism enables fast builds by running independent tasks simultaneously, while incrementality saves time and computing resources by processing the build operations that were affected by a particular code change. Writing build definitions that lead to error-free incremental and parallel builds is a challenging task. This is mainly because developers are often unable to predict the effects of build operations on the file system and how different build operations interact with each other. Faulty build scripts may seriously degrade the reliability of automated builds, as they cause build failures, and non-deterministic and incorrect build results. To reason about arbitrary build executions, we present buildfs, a generally-applicable model that takes into account the specification (as declared in build scripts) and the actual behavior (low-level file system operation) of build operations. We then formally define different types of faults related to incremental and parallel builds in terms of the conditions under which a file system operation violates the specification of a build operation. Our testing approach, which relies on the proposed model, analyzes the execution of single full build, translates it into buildfs, and uncovers faults by checking for corresponding violations. We evaluate the effectiveness, efficiency, and applicability of our approach by examining hundreds of Make and Gradle projects. Notably, our method is the first to handle Java-oriented build systems. The results indicate that our approach is (1) able to uncover several important issues (245 issues found in 45 open-source projects have been confirmed and fixed by the upstream developers), and (2) orders of magnitude faster than a state-of-the-art tool for Make builds.

15.8SEApr 15, 2019Code

Semantic Source Code Models Using Identifier Embeddings

Vasiliki Efstathiou, Diomidis Spinellis

The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.

9.6SEMar 6, 2018Code

Code Review Comments: Language Matters

Vasiliki Efstathiou, Diomidis Spinellis

Recent research provides evidence that effective communication in collaborative software development has significant impact on the software development lifecycle. Although related qualitative and quantitative studies point out textual characteristics of well-formed messages, the underlying semantics of the intertwined linguistic structures still remain largely misinterpreted or ignored. Especially, regarding quality of code reviews the importance of thorough feedback, and explicit rationale is often mentioned but rarely linked with related linguistic features. As a first step towards addressing this shortcoming, we propose grounding these studies on theories of linguistics. We particularly focus on linguistic structures of coherent speech and explain how they can be exploited in practice. We reflect on related approaches and examine through a preliminary study on four open source projects, possible links between existing findings and the directions we suggest for detecting textual features of useful code reviews.

3.6SEDec 23, 2021

Software Engineering Education Knowledge Versus Industrial Needs

Georgios Liargkovas, Angeliki Papadopoulou, Zoe Kotti et al.

Contribution: Determine and analyze the gap between software practitioners' education outlined in the 2014IEEE/ACM Software Engineering Education Knowledge (SEEK) and industrial needs pointed by Wikipedia articles referenced in Stack Overflow (SO) posts. Background: Previous work has uncovered deficiencies in the coverage of computer fundamentals, people skills, software processes, and human-computer interaction, suggesting rebalancing. Research Questions: 1) To what extent are developers' needs, in terms of Wikipedia articles referenced in SO posts, covered by the SEEK knowledge units? 2) How does the popularity of Wikipedia articles relate to their SEEK coverage? 3) What areas of computing knowledge can be better covered by the SEEK knowledge units? 4) Why are Wikipedia articles covered by the SEEK knowledge units cited on SO? Methodology: Wikipedia articles were systematically collected from SO posts. The most cited were manually mapped to the SEEK knowledge units, assessed according to their degree of coverage. Articles insufficiently covered by the SEEK were classified by hand using the 2012 ACM Computing Classification System. A sample of posts referencing sufficiently covered articles was manually analyzed. A survey was conducted on software practitioners to validate the study findings. Findings: SEEK appears to cover sufficiently computer science fundamentals, software design and mathematical concepts, but less so areas like the World Wide Web, software engineering components, and computer graphics. Developers seek advice, best practices and explanations about software topics, and code review assistance. Future SEEK models and the computing education could dive deeper in information systems, design, testing, security, and soft skills.

12.0SENov 29, 2021

Standing on Shoulders or Feet? An Extended Study on the Usage of the MSR Data Papers

Zoe Kotti, Konstantinos Kravvaritis, Konstantina Dritsa et al.

Context: The establishment of the Mining Software Repositories (MSR) data showcase conference track has encouraged researchers to provide data sets as a basis for further empirical studies. Objective: Examine the usage of data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Method: Data track papers were collected from the MSR data showcase track and through the manual inspection of older MSR proceedings. The use of data papers was established through manual citation searching followed by reading the citing studies and dividing them into strong and weak citations. Contrary to weak, strong citations truly use the data set of a data paper. Data papers were then manually clustered based on their content, whereas their strong citations were classified by hand according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. A survey study on 108 authors and users of data papers provided further insights regarding motivation and effort in data paper production, encouraging and discouraging factors in data set use, and future desired direction regarding data papers. Results: We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of strong citations. Weak citations to data papers usually refer to them as an example. MSR data papers are cited in total less than other MSR papers. A considerable number of the strong citations stem from the teams that authored the data papers. Publications providing Version Control System (VCS) primary and derived data are the most frequent data papers and the most often strongly cited ones. Enhanced developer data papers are the least common ones, and the second least frequently strongly cited. Data paper authors tend to gather data in the context of other research. [...]

11.7PLFeb 28, 2021Code

PyCG: Practical Call Graph Generation in Python

Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas et al.

Call graphs play an important role in different contexts, such as profiling and vulnerability propagation analysis. Generating call graphs in an efficient manner can be a challenging task when it comes to high-level languages that are modular and incorporate dynamic features and higher-order functions. Despite the language's popularity, there have been very few tools aiming to generate call graphs for Python programs. Worse, these tools suffer from several effectiveness issues that limit their practicality in realistic programs. We propose a pragmatic, static approach for call graph generation in Python. We compute all assignment relations between program identifiers of functions, variables, classes, and modules through an inter-procedural analysis. Based on these assignment relations, we produce the resulting call graph by resolving all calls to potentially invoked functions. Notably, the underlying analysis is designed to be efficient and scalable, handling several Python features, such as modules, generators, function closures, and multiple inheritance. We have evaluated our prototype implementation, which we call PyCG, using two benchmarks: a micro-benchmark suite containing small Python programs and a set of macro-benchmarks with several popular real-world Python packages. Our results indicate that PyCG can efficiently handle thousands of lines of code in less than a second (0.38 seconds for 1k LoC on average). Further, it outperforms the state-of-the-art for Python in both precision and recall: PyCG achieves high rates of precision ~99.2%, and adequate recall ~69.9%. Finally, we demonstrate how PyCG can aid dependency impact analysis by showcasing a potential enhancement to GitHub's "security advisory" notification service using a real-world example.

8.9SEDec 22, 2020

Do We Need Improved Code Quality Metrics?

Tushar Sharma, Diomidis Spinellis

The software development community has been using code quality metrics for the last five decades. Despite their wide adoption, code quality metrics have attracted a fair share of criticism. In this paper, first, we carry out a qualitative exploration by surveying software developers to gauge their opinions about current practices and potential gaps with the present set of metrics. We identify deficiencies including lack of soundness, i.e., the ability of a metric to capture a notion accurately as promised by the metric, lack of support for assessing software architecture quality, and insufficient support for assessing software testing and infrastructure. In the second part of the paper, we focus on one specific code quality metric-LCOM as a case study to explore opportunities towards improved metrics. We evaluate existing LCOM algorithms qualitatively and quantitatively to observe how closely they represent the concept of cohesion. In this pursuit, we first create eight diverse cases that any LCOM algorithm must cover and obtain their cohesion levels by a set of experienced developers and consider them as a ground truth. We show that the present set of LCOM algorithms do poorly w.r.t. these cases. To bridge the identified gap, we propose a new approach to compute LCOM and evaluate the new approach with the ground truth. We also show, using a quantitative analysis using more than 90 thousand types belonging to 261 high-quality Java repositories, the present set of methods paint a very inaccurate and misleading picture of class cohesion. We conclude that the current code quality metrics in use suffer from various deficiencies, presenting ample opportunities for the research community to address the gaps.

8.9SENov 16, 2020

The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli

Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on ''most starred'' repositories as it often happens.

30.5SEOct 7, 2020Code

Empirical Standards for Software Engineering Research

Paul Ralph, Nauman bin Ali, Sebastian Baltes et al.

Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.

13.8SEFeb 6, 2020

A Dataset for GitHub Repository Deduplication: Extended Description

Diomidis Spinellis, Zoe Kotti, Audris Mockus

GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

5.0SEMay 27, 2019

Detecting Missing Dependencies and Notifiers in Puppet Programs

Thodoris Sotiropoulos, Dimitris Mitropoulos, Diomidis Spinellis

Puppet is a popular computer system configuration management tool. It provides abstractions that enable administrators to setup their computer systems declaratively. Its use suffers from two potential pitfalls. First, if ordering constraints are not specified whenever an abstraction depends on another, the non-deterministic application of abstractions can lead to race conditions. Second, if a service is not tied to its resources through notification constructs, the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure's availability and functionality. We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, we present a formal model for traces, which allows us to capture the interactions of Puppet abstractions with the file system. By analyzing these interactions we identify (1) abstractions that are related to each other (e.g., operate on the same file), and (2) abstractions that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace's analysis against the program's dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault. We have evaluated our method on a large set of Puppet modules, and discovered 57 previously unknown issues in 30 of them. Benchmarking further shows that our approach can analyze in minutes real-world configurations with a magnitude measured in thousands of lines and millions of system calls.

16.5SEApr 5, 2019

On the Feasibility of Transfer-learning Code Smells using Deep Learning

Tushar Sharma, Vasiliki Efstathiou, Panos Louridas et al.

Context: A substantial amount of work has been done to detect smells in source code using metrics-based and heuristics-based methods. Machine learning methods have been recently applied to detect source code smells; however, the current practices are considered far from mature. Objective: First, explore the feasibility of applying deep learning models to detect smells without extensive feature engineering, just by feeding the source code in tokenized form. Second, investigate the possibility of applying transfer-learning in the context of deep learning models for smell detection. Method: We use existing metric-based state-of-the-art methods for detecting three implementation smells and one design smell in C# code. Using these results as the annotated gold standard, we train smell detection models on three different deep learning architectures. These architectures use Convolution Neural Networks (CNNs) of one or two dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden layers. For the first objective of our study, we perform training and evaluation on C# samples, whereas for the second objective, we train the models from C# code and evaluate the models over Java code samples. We perform the experiments with various combinations of hyper-parameters for each model. Results: We find it feasible to detect smells using deep learning methods. Our comparative experiments find that there is no clearly superior method between CNN-1D and CNN-2D. We also observe that performance of the deep learning models is smell-specific. Our transfer-learning experiments show that transfer-learning is definitely feasible for implementation smells with performance comparable to that of direct-learning. This work opens up a new paradigm to detect code smells by transfer-learning especially for the programming languages where the comprehensive code smell detection tools are not available.

8.5SEMar 10, 2019

Does Unit-Tested Code Crash? A Case Study of Eclipse

Efstathia Chioteli, Ioannis Batas, Diomidis Spinellis

Context: Software development projects increasingly adopt unit testing as a way to identify and correct program faults early in the construction process. Code that is unit tested should therefore have fewer failures associated with it. Objective: Compare the number of field failures arising in unit tested code against those arising in code that has not been unit tested. Method: We retrieved 2,083,979 crash incident reports associated with the Eclipse integrated development environment project, and processed them to obtain a set of 126,026 unique program failure stack traces associated with a specific popular release. We then run the JaCoCo code test coverage analysis on the same release, obtaining results on the line, instruction, and branch-level coverage of 216,392 methods. We also extracted from the source code the classes that are linked to a corresponding test class so as to limit test code coverage results to 1,267 classes with actual tests. Finally, we correlated unit tests with failures at the level of 9,523 failing tested methods. Results: Unit-tested code does not appear to be associated with fewer failures. Conclusion: Unit testing on its own may not be a sufficient method for preventing program failures.