7.4SEMar 2, 2022
Code Smells in Machine Learning SystemsJiri Gesi, Siqi Liu, Jiawei Li et al.
As Deep learning (DL) systems continuously evolve and grow, assuring their quality becomes an important yet challenging task. Compared to non-DL systems, DL systems have more complex team compositions and heavier data dependency. These inherent characteristics would potentially cause DL systems to be more vulnerable to bugs and, in the long run, to maintenance issues. Code smells are empirically tested as efficient indicators of non-DL systems. Therefore, we took a step forward into identifying code smells, and understanding their impact on maintenance in this comprehensive study. This is the first study on investigating code smells in the context of DL software systems, which helps researchers and practitioners to get a first look at what kind of maintenance modification made and what code smells developers have been dealing with. Our paper has three major contributions. First, we comprehensively investigated the maintenance modifications that have been made by DL developers via studying the evolution of DL systems, and we identified nine frequently occurred maintenance-related modification categories in DL systems. Second, we summarized five code smells in DL systems. Third, we validated the prevalence, and the impact of our newly identified code smells through a mixture of qualitative and quantitative analysis. We found that our newly identified code smells are prevalent and impactful on the maintenance of DL systems from the developer's perspective.
3.1SEOct 19, 2023
Automated Repair of Declarative Software Specifications in the Era of Large Language ModelsMd Rashedul Hasan, Jiawei Li, Iftekhar Ahmed et al.
The growing adoption of declarative software specification languages, coupled with their inherent difficulty in debugging, has underscored the need for effective and automated repair techniques applicable to such languages. Researchers have recently explored various methods to automatically repair declarative software specifications, such as template-based repair, feedback-driven iterative repair, and bounded exhaustive approaches. The latest developments in large language models provide new opportunities for the automatic repair of declarative specifications. In this study, we assess the effectiveness of utilizing OpenAI's ChatGPT to repair software specifications written in the Alloy declarative language. Unlike imperative languages, specifications in Alloy are not executed but rather translated into logical formulas and evaluated using backend constraint solvers to identify specification instances and counterexamples to assertions. Our evaluation focuses on ChatGPT's ability to improve the correctness and completeness of Alloy declarative specifications through automatic repairs. We analyze the results produced by ChatGPT and compare them with those of leading automatic Alloy repair methods. Our study revealed that while ChatGPT falls short in comparison to existing techniques, it was able to successfully repair bugs that no other technique could address. Our analysis also identified errors in ChatGPT's generated repairs, including improper operator usage, type errors, higher-order logic misuse, and relational arity mismatches. Additionally, we observed instances of hallucinations in ChatGPT-generated repairs and inconsistency in its results. Our study provides valuable insights for software practitioners, researchers, and tool builders considering ChatGPT for declarative specification repairs.
1.0CLApr 4, 2024
The Death of Feature Engineering? BERT with Linguistic Features on SQuAD 2.0Jiawei Li, Yue Zhang · amazon-science, stanford
Machine reading comprehension is an essential natural language processing task, which takes into a pair of context and query and predicts the corresponding answer to query. In this project, we developed an end-to-end question answering model incorporating BERT and additional linguistic features. We conclude that the BERT base model will be improved by incorporating the features. The EM score and F1 score are improved 2.17 and 2.14 compared with BERT(base). Our best single model reaches EM score 76.55 and F1 score 79.97 in the hidden test set. Our error analysis also shows that the linguistic architecture can help model understand the context better in that it can locate answers that BERT only model predicted "No Answer" wrongly.
13.3SEAug 10, 2021
PyNose: A Test Smell Detector For PythonTongjie Wang, Yaroslav Golubev, Oleg Smirnov et al.
Similarly to production code, code smells also occur in test code, where they are called test smells. Test smells have a detrimental effect not only on test code but also on the production code that is being tested. To date, the majority of the research on test smells has been focusing on programming languages such as Java and Scala. However, there are no available automated tools to support the identification of test smells for Python, despite its rapid growth in popularity in recent years. In this paper, we strive to extend the research to Python, build a tool for detecting test smells in this language, and conduct an empirical analysis of test smells in Python projects. We started by gathering a list of test smells from existing research and selecting test smells that can be considered language-agnostic or have similar functionality in Python's standard Unittest framework. In total, we identified 17 diverse test smells. Additionally, we searched for Python-specific test smells by mining frequent code change patterns that can be considered as either fixing or introducing test smells. Based on these changes, we proposed our own test smell called Suboptimal assert. To detect all these test smells, we developed a tool called PyNose in the form of a plugin to PyCharm, a popular Python IDE. Finally, we conducted a large-scale empirical investigation aimed at analyzing the prevalence of test smells in Python code. Our results show that 98% of the projects and 84% of the test suites in the studied dataset contain at least one test smell. Our proposed Suboptimal assert smell was detected in as much as 70.6% of the projects, making it a valuable addition to the list.
8.6SEMay 21, 2021
Changes from the Trenches: Should We Automate Them?Yaroslav Golubev, Jiawei Li, Viacheslav Bushev et al.
Code changes constitute one of the most important features of software evolution. Studying them can provide insights into the nature of software development and also lead to practical solutions - recommendations and automations of popular changes for developers. In our work, we developed a tool called PythonChangeMiner that allows to discover code change patterns in the histories of Python projects. We validated the tool and then employed it to discover patterns in the dataset of 120 projects from four different domains of software engineering. We manually categorized patterns that occur in more than one project from the standpoint of their structure and content, and compared different domains and patterns in that regard. We conducted a survey of the authors of the discovered changes: 82.9% of them said that they can give the change a name and 57.9% expressed their desire to have the changes automated, indicating the ability of the tool to discover valuable patterns. Finally, we interviewed 9 members of a popular integrated development environment (IDE) development team to estimate the feasibility of automating the discovered changes. It was revealed that independence from the context and high precision made a pattern a better candidate for automation. The patterns received mainly positive reviews and several were ranked as very likely for automation.