SEAug 5, 2022
Out of the BLEU: how should we assess quality of the Code Generation models?Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov et al.
In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well they agree with the human evaluation on this task. There are also other metrics, CodeBLEU and RUBY, developed to estimate the similarity of code, that take into account the properties of source code. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim superiority of some code generation models over the others. In this paper, we present a study on the applicability of six metrics -- BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY -- for evaluation of code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners, none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU. Yet, finding a metric for code generation that closely agrees with humans requires additional work.
SEMar 6, 2023
Judging Adam: Studying the Performance of Optimization Methods on ML4SE TasksDmitry Pasechnyuk, Anton Prazdnichnykh, Mikhail Evtikhiev et al.
Solving a problem with a deep learning model requires researchers to optimize the loss function with a certain optimization method. The research community has developed more than a hundred different optimizers, yet there is scarce data on optimizer performance in various tasks. In particular, none of the benchmarks test the performance of optimizers on source code-related problems. However, existing benchmark data indicates that certain optimizers may be more efficient for particular domains. In this work, we test the performance of various optimizers on deep learning models for source code and find that the choice of an optimizer can have a significant impact on the model quality, with up to two-fold score differences between some of the relatively well-performing optimizers. We also find that RAdam optimizer (and its modification with the Lookahead envelope) is the best optimizer that almost always performs well on the tasks we consider. Our findings show a need for a more extensive study of the optimizers in code-related tasks, and indicate that the ML4SE community should consider using RAdam instead of Adam as the default optimizer for code-related deep learning tasks.
SEMar 17, 2021Code
TNM: A Tool for Mining of Socio-Technical Data from Git RepositoriesNikolai Sviridov, Mikhail Evtikhiev, Vladimir Kovalenko
Networks of collaboration between engineers are reflected in traces of developers' activity in version control systems (VCSs). Extracting data from Git repositories is an essential task for researchers and practitioners working on socio-technical analysis, but it requires substantial engineering work. With increasing interest in analysing socio-technical data and applying it in practice, there are no flexible and easily reusable tools to retrieve socio-technical information from VCSs. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from their core research questions. In this paper, we present TNM -- an open-source tool for mining socio-technical data from Git repositories. TNM is fast, flexible, and easily extensible. TNM is available on GitHub: https://github.com/JetBrains-Research/tnm
SEFeb 3, 2022
Bus Factor In PracticeElgun Jabrayilzade, Mikhail Evtikhiev, Eray Tüzün et al.
Bus factor is a metric that identifies how resilient is the project to the sudden engineer turnover. It states the minimal number of engineers that have to be hit by a bus for a project to be stalled. Even though the metric is often discussed in the community, few studies consider its general relevance. Moreover, the existing tools for bus factor estimation focus solely on the data from version control systems, even though there exists other channels for knowledge generation and distribution. With a survey of 269 engineers, we find that the bus factor is perceived as an important problem in collective development, and determine the highest impact channels of knowledge generation and distribution in software development teams. We also propose a multimodal bus factor estimation algorithm that uses data on code reviews and meetings together with the VCS data. We test the algorithm on 13 projects developed at JetBrains and compared its results to the results of the state-of-the-art tool by Avelino et al. against the ground truth collected in a survey of the engineers working on these projects. Our algorithm is slightly better in terms of both predicting the bus factor as well as key developers compared to the results of Avelino et al. Finally, we use the interviews and the surveys to derive a set of best practices to address the bus factor issue and proposals for the possible bus factor assessment tool.