SEAug 15, 2023
From Commit Message Generation to History-Aware Commit Message CompletionAleksandra Eliseeva, Yaroslav Sokolov, Egor Bogomolov et al.
Commit messages are crucial to software development, allowing developers to track changes and collaborate effectively. Despite their utility, most commit messages lack important information since writing high-quality commit messages is tedious and time-consuming. The active research on commit message generation (CMG) has not yet led to wide adoption in practice. We argue that if we could shift the focus from commit message generation to commit message completion and use previous commit history as additional context, we could significantly improve the quality and the personal nature of the resulting commit messages. In this paper, we propose and evaluate both of these novel ideas. Since the existing datasets lack historical data, we collect and share a novel dataset called CommitChronicle, containing 10.7M commits across 20 programming languages. We use this dataset to evaluate the completion setting and the usefulness of the historical context for state-of-the-art CMG models and GPT-3.5-turbo. Our results show that in some contexts, commit message completion shows better results than generation, and that while in general GPT-3.5-turbo performs worse, it shows potential for long and detailed messages. As for the history, the results show that historical information improves the performance of CMG models in the generation task, and the performance of GPT-3.5-turbo in both generation and completion.
SEMay 21, 2022
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage LogsVitaliy Bibaev, Alexey Kalina, Vadim Lomshakov et al.
In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.
SENov 9, 2025
Understanding Student Interaction with AI-Powered Next-Step Hints: Strategies and ChallengesAnastasiia Birillo, Aleksei Rostovskii, Yaroslav Golubev et al.
Automated feedback generation plays a crucial role in enhancing personalized learning experiences in computer science education. Among different types of feedback, next-step hint feedback is particularly important, as it provides students with actionable steps to progress towards solving programming tasks. This study investigates how students interact with an AI-driven next-step hint system in an in-IDE learning environment. We gathered and analyzed a dataset from 34 students solving Kotlin tasks, containing detailed hint interaction logs. We applied process mining techniques and identified 16 common interaction scenarios. Semi-structured interviews with 6 students revealed strategies for managing unhelpful hints, such as adapting partial hints or modifying code to generate variations of the same hint. These findings, combined with our publicly available dataset, offer valuable opportunities for future research and provide key insights into student behavior, helping improve hint design for enhanced learning support.
SEDec 3, 2024Code
Drawing Pandas: A Benchmark for LLMs in Generating Plotting CodeTimur Galimzyanov, Sergey Titov, Yaroslav Golubev et al.
This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our dataset and benchmark code are available online: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench; https://github.com/JetBrains-Research/PandasPlotBench.
SEOct 11, 2024Code
One Step at a Time: Combining LLMs and Static Analysis to Generate Next-Step Hints for Programming TasksAnastasiia Birillo, Elizaveta Artser, Anna Potriasaeva et al.
Students often struggle with solving programming problems when learning to code, especially when they have to do it online, with one of the most common disadvantages of working online being the lack of personalized help. This help can be provided as next-step hint generation, i.e., showing a student what specific small step they need to do next to get to the correct solution. There are many ways to generate such hints, with large language models (LLMs) being among the most actively studied right now. While LLMs constitute a promising technology for providing personalized help, combining them with other techniques, such as static analysis, can significantly improve the output quality. In this work, we utilize this idea and propose a novel system to provide both textual and code hints for programming tasks. The pipeline of the proposed approach uses a chain-of-thought prompting technique and consists of three distinct steps: (1) generating subgoals - a list of actions to proceed with the task from the current student's solution, (2) generating the code to achieve the next subgoal, and (3) generating the text to describe this needed action. During the second step, we apply static analysis to the generated code to control its size and quality. The tool is implemented as a modification to the open-source JetBrains Academy plugin, supporting students in their in-IDE courses. To evaluate our approach, we propose a list of criteria for all steps in our pipeline and conduct two rounds of expert validation. Finally, we evaluate the next-step hints in a classroom with 14 students from two universities. Our results show that both forms of the hints - textual and code - were helpful for the students, and the proposed system helped them to proceed with the coding tasks.
SEMay 14, 2024Code
Full Line Code Completion: Bringing AI to DesktopAnton Semenkin, Vitaliy Bibaev, Yaroslav Sokolov et al.
In recent years, several industrial solutions for the problem of multi-token code completion appeared, each making a great advance in the area but mostly focusing on cloud-based runtime and avoiding working on the end user's device. In this work, we describe our approach for building a multi-token code completion feature for the JetBrains' IntelliJ Platform, which we call Full Line Code Completion. The feature suggests only syntactically correct code and works fully locally, i.e., data querying and the generation of suggestions happens on the end user's machine. We share important time and memory-consumption restrictions, as well as design principles that a code completion engine should satisfy. Working entirely on the end user's device, our code completion engine enriches user experience while being not only fast and compact but also secure. We share a number of useful techniques to meet the stated development constraints and also describe offline and online evaluation pipelines that allowed us to make better decisions. Our online evaluation shows that the usage of the tool leads to 1.3 times more Python code in the IDE being produced by code completion. The described solution was initially started with a help of researchers and was then bundled into all JetBrains IDEs where it is now used by millions of users. Thus, we believe that this work is useful for bridging academia and industry, providing researchers with the knowledge of what happens when complex research-based solutions are integrated into real products.
49.7SEMar 30
Evolving with AI: A Longitudinal Analysis of Developer LogsAgnia Sergeyuk, Eric Huang, Dariia Karaeva et al.
AI-powered coding assistants are rapidly becoming fixtures in professional IDEs, yet their sustained influence on everyday development remains poorly understood. Prior research has focused on short-term use or self-reported perceptions, leaving open questions about how sustained AI use reshapes actual daily coding practices in the long term. We address this gap with a mixed-method study of AI adoption in IDEs, combining longitudinal two-year fine-grained telemetry from 800 developers with a survey of 62 professionals. We analyze five dimensions of workflow change: productivity, code quality, code editing, code reuse, and context switching. Telemetry reveals that AI users produce substantially more code but also delete significantly more. Meanwhile, survey respondents report productivity gains and perceive minimal changes in other dimensions. Our results offer empirical insights into the silent restructuring of software workflows and provide implications for designing future AI-augmented tooling.
SEDec 19, 2024Code
Stack Trace Deduplication: Faster, More Accurately, and in More Realistic ScenariosEgor Shibaev, Denis Sushentsev, Yaroslav Golubev et al.
In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic deduplication into categories is necessary to allow for processing. Recent works have proposed powerful deep learning-based approaches for this, but they are evaluated and compared in isolation from real-life workflows, and it is not clear whether they will actually work well at scale. To overcome this gap, this work presents three main contributions: a novel model, an industry-based dataset, and a multi-faceted evaluation. Our model consists of two parts - (1) an embedding model with byte-pair encoding and approximate nearest neighbor search to quickly find the most relevant stack traces to the incoming one, and (2) a reranker that re-ranks the most fitting stack traces, taking into account the repeated frames between them. To complement the existing datasets collected from open-source projects, we share with the community SlowOps - a dataset of stack traces from IntelliJ-based products developed by JetBrains, which has an order of magnitude more stack traces per category. Finally, we carry out an evaluation that strives to be realistic: measuring not only the accuracy of categorization, but also the operation time and the ability to create new categories. The evaluation shows that our model strikes a good balance - it outperforms other models on both open-source datasets and SlowOps, while also being faster on time than most. We release all of our code and data, and hope that our work can pave the way to further practice-oriented research in the area.
SEOct 14, 2025Code
Diff-XYZ: A Benchmark for Evaluating Diff UnderstandingEvgeniy Glukhov, Michele Conti, Egor Bogomolov et al.
Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format performs best for larger models across most tasks, while structured udiff variants offer similar but slightly weaker performance. In contrast, smaller open models benefit little from any formatting choice. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.
SEOct 5, 2025Code
Challenge on Optimization of Context Collection for Code CompletionDmitry Ustalov, Egor Bogomolov, Alexander Bezzubov et al.
The rapid advancement of workflows and methods for software engineering using AI emphasizes the need for a systematic evaluation and analysis of their ability to leverage information from entire projects, particularly in large code bases. In this challenge on optimization of context collection for code completion, organized by JetBrains in collaboration with Mistral AI as part of the ASE 2025 conference, participants developed efficient mechanisms for collecting context from source code repositories to improve fill-in-the-middle code completions for Python and Kotlin. We constructed a large dataset of real-world code in these two programming languages using permissively licensed open-source projects. The submissions were evaluated based on their ability to maximize completion quality for multiple state-of-the-art neural models using the chrF metric. During the public phase of the competition, nineteen teams submitted solutions to the Python track and eight teams submitted solutions to the Kotlin track. In the private phase, six teams competed, of which five submitted papers to the workshop.
LGJun 17, 2024Code
Long Code Arena: a Set of Benchmarks for Long-Context Code ModelsEgor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov et al.
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
SEDec 30, 2021Code
AntiCopyPaster: Extracting Code Duplicates As Soon As They Are Introduced in the IDEEman Abdullah AlOmar, Anton Ivanov, Zarina Kurbatova et al.
We developed a plugin for IntelliJ IDEA called AntiCopyPaster, which tracks the pasting of code fragments inside the IDE and suggests the appropriate Extract Method refactoring to combat the propagation of duplicates. Unlike the existing approaches, our tool is integrated with the developer's workflow, and pro-actively recommends refactorings. Since not all code fragments need to be extracted, we develop a classification model to make this decision. When a developer copies and pastes a code fragment, the plugin searches for duplicates in the currently opened file, waits for a short period of time to allow the developer to edit the code, and finally inferences the refactoring decision based on a number of features. Our experimental study on a large dataset of 18,942 code fragments mined from 13 Apache projects shows that AntiCopyPaster correctly recommends Extract Method refactorings with an F-score of 0.82. Furthermore, our survey of 59 developers reflects their satisfaction with the developed plugin's operation. The plugin and its source code are publicly available on GitHub at https://github.com/JetBrains-Research/anti-copy-paster. The demonstration video can be found on YouTube: https://youtu.be/_wwHg-qFjJY.
SEAug 25, 2021Code
Revizor: A Data-Driven Approach to Automate Frequent Code Changes Based on Graph MatchingOleg Smirnov, Artyom Lobanov, Yaroslav Golubev et al.
Many code changes that developers make in their projects are repeated and constitute recurrent change patterns. It is of interest to collect such patterns from the version history of open-source repositories and suggest the most useful of them as quick fixes. In this paper, we present Revizor - a tool aimed to build custom plugins for PyCharm, a popular Python IDE. A Revizor-based plugin can take change patterns and highlight potential places for their application in the developer's code editor. If the developer accepts the quick fix, the plugin automatically performs the edit. Our approach uses a graph-based representation of code changes, which allows it to support complex distributed code patterns. Experienced developers have also rated the usability and the performance of such Revizor-based plugin positively. The source code of the tool and test plugin prototype are available on GitHub: https://github.com/JetBrains-Research/revizor. A demonstration video with a short tool description can be found on YouTube: https://youtu.be/5eLs14nco7E.
SEJul 28, 2021Code
Sorrel: an IDE Plugin for Managing Licenses and Detecting License IncompatibilitiesDmitry Pogrebnoy, Ivan Kuznetsov, Yaroslav Golubev et al.
Software development is a complex process that includes many different tasks besides just writing code. One of the aspects of software engineering is selecting and managing licenses for the given project. In this paper, we present Sorrel - a plugin for managing licenses and detecting potential incompatibilities for IntelliJ IDEA, a popular Java IDE. The plugin scans the project in search of information about the project license and the licenses of its libraries. If the project does not yet have a license, the plugin provides the developer with recommendations for choosing the most suitable open license, and if there is a license, it informs the programmer about potential licensing violations. The tool makes it easier for developers to choose a proper license for a project and avoid most of the licensing errors - all inside the familiar IDE editor. The plugin and its source code are available online on GitHub: https://github.com/JetBrains-Research/sorrel. A demonstration video can be found at https://youtu.be/doUeAwPjcPE.
SEJul 9, 2021Code
On the Nature of Code Cloning in Open-Source Java ProjectsYaroslav Golubev, Timofey Bryksin
Code cloning plays a very important role in open-source software engineering. The presence of clones within a project may indicate a need for refactoring, and clones between projects are even more interesting, since code migration takes place and violations are possible. But how is code being copied? How prevalent is the process and on what level does it happen? In this general study, we attempt to shed some light on these questions by searching for clones in a large dataset of over 23 thousand Java projects on the level of both files and methods, and by studying the code fragments themselves and their clone pairs. We study the size and the age of code fragments, the prevalence of their clones, relationships between exact and non-exact clones, as well as between method-level and file-level clones. We also discover and describe various anomalies in the code clones that were detected in the dataset. Our research shows that the copying occurs all through the years of the Java code existence and that method-level copying is much more prevalent than file-level copying, with only 35.4% of methods having no clones at all. Additionally, some of the discovered anomalies can be useful for future large-scale cloning research as they can be used for removing auto-generated code.
SEJul 6, 2020Code
Sosed: a tool for finding similar software projectsEgor Bogomolov, Yaroslav Golubev, Artyom Lobanov et al.
In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/buckwheat/.
SEFeb 15, 2020Code
Recommendation of Move Method Refactoring Using Path-Based Representation of CodeZarina Kurbatova, Ivan Veselov, Yaroslav Golubev et al.
Software refactoring plays an important role in increasing code quality. One of the most popular refactoring types is the Move Method refactoring. It is usually applied when a method depends more on members of other classes than on its own original class. Several approaches have been proposed to recommend Move Method refactoring automatically. Most of them are based on heuristics and have certain limitations (e.g., they depend on the selection of metrics and manually-defined thresholds). In this paper, we propose an approach to recommend Move Method refactoring based on a path-based representation of code called code2vec that is able to capture the syntactic structure and semantic information of a code fragment. We use this code representation to train a machine learning classifier suggesting to move methods to more appropriate classes. We evaluate the approach on two publicly available datasets: a manually compiled dataset of well-known open-source projects and a synthetic dataset with automatically injected code smell instances. The results show that our approach is capable of recommending accurate refactoring opportunities and outperforms JDeodorant and JMove, which are state of the art tools in this field.
SEFeb 12, 2020Code
A Study of Potential Code Borrowing and License Violations in Java Projects on GitHubYaroslav Golubev, Maria Eliseeva, Nikita Povarov et al.
With an ever-increasing amount of open source software, the popularity of services like GitHub that facilitate code reuse, and common misconceptions about the licensing of open source software, the problem of license violations in the code is getting more and more prominent. In this study, we compile an extensive corpus of popular Java projects from GitHub, search it for code clones and perform an original analysis of possible code borrowing and license violations on the level of code fragments. We chose Java as a language because of its popularity in industry, where the plagiarism problem is especially relevant because of possible legal action. We analyze and discuss distribution of 94 different discovered and manually evaluated licenses in files and projects, differences in the licensing of files, distribution of potential code borrowing between licenses, various types of possible license violations, most violated licenses, etc. Studying possible license violations in specific blocks of code, we have discovered that 29.6% of them might be involved in potential code borrowing and 9.4% of them could potentially violate original licenses.
SEOct 7, 2025
Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project UnderstandingNikita Pavlichenko, Iurii Nazarov, Ivan Dolgov et al.
We present the Mellum models family, open-weight code completion models designed for interactive use in JetBrains IDEs. Mellums have 4B parameters, adopt a Llama-style architecture, and are pre-trained on ~4T tokens of permissively licensed, multi-language code. Our studies show that (i) careful data curation and staged training significantly improve the model's quality, (ii) editor-critical capabilities such as context packing are necessary for high-quality suggestions, and (iii) a compact, task-focused model can meet the cost and latency constraints of interactive completion. In the paper, we describe an end-to-end industrial pipeline for producing contextualized in-editor completion: disciplined data governance, multi-stage training that includes fill-in-the-middle and project context via supervised fine-tuning, and alignment via direct preference optimization using feedback from real-world scenarios. Our quality evaluations include both large-scale offline benchmarks and online telemetry from production deployments in JetBrains IDEs. Mellums are released under the Apache-2.0 license on HuggingFace, with a public model card providing a reproducible reference for practitioners. Our experience offers a pragmatic blueprint for taking a focused, open model from a research prototype to at scale production for hundreds of thousands of users.
SEOct 15, 2024
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline SettingsPetr Tsvetkov, Aleksandra Eliseeva, Danny Dig et al.
When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: https://jb.gg/cmg-evaluation.
SEJun 11, 2024
Using AI-Based Coding Assistants in Practice: State of Affairs, Perceptions, and Ways ForwardAgnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin et al.
Context. The last several years saw the emergence of AI assistants for code - multi-purpose AI-based helpers in software engineering. As they become omnipresent in all aspects of software development, it becomes critical to understand their usage patterns. Objective. We aim to better understand how specifically developers are using AI assistants, why they are not using them in certain parts of their development workflow, and what needs to be improved in the future. Methods. In this work, we carried out a large-scale survey aimed at how AI assistants are used, focusing on specific software development activities and stages. We collected opinions of 481 programmers on five broad activities: (a) implementing new features, (b) writing tests, (c) bug triaging, (d) refactoring, and (e) writing natural-language artifacts, as well as their individual stages. Results. Our results provide a novel comparison of different stages where AI assistants are used that is both comprehensive and detailed. It highlights specific activities that developers find less enjoyable and want to delegate to an AI assistant, e.g., writing tests and natural-language artifacts. We also determine more granular stages where AI assistants are used, such as generating tests and generating docstrings, as well as less studied parts of the workflow, such as generating test data. Among the reasons for not using assistants, there are general aspects like trust and company policies, as well as more concrete issues like the lack of project-size context, which can be the focus of the future research. Conclusion. The provided analysis highlights stages of software development that developers want to delegate and that are already popular for using AI assistants, which can be a good focus for features aimed to help developers right now. The main reasons for not using AI assistants can serve as a guideline for future work.
SEJan 14, 2022
DapStep: Deep Assignee Prediction for Stack Trace Error rePresentationDenis Sushentsev, Aleksandr Khvorov, Roman Vasiliev et al.
The task of finding the best developer to fix a bug is called bug triage. Most of the existing approaches consider the bug triage task as a classification problem, however, classification is not appropriate when the sets of classes change over time (as developers often do in a project). Furthermore, to the best of our knowledge, all the existing models use textual sources of information, i.e., bug descriptions, which are not always available. In this work, we explore the applicability of existing solutions for the bug triage problem when stack traces are used as the main data source of bug reports. Additionally, we reformulate this task as a ranking problem and propose new deep learning models to solve it. The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network, with the weights of the models optimized using a ranking loss function. To improve the quality of ranking, we propose using additional information from version control system annotations. Two approaches are proposed for extracting features from annotations: manual and using an additional neural network. To evaluate our models, we collected two datasets of real-world stack traces. Our experiments show that the proposed models outperform existing models adapted to handle stack traces. To facilitate further research in this area, we publish the source code of our models and one of the collected datasets.
SEDec 29, 2021
ReSplit: Improving the Structure of Jupyter Notebooks by Re-Splitting Their CellsSergey Titov, Yaroslav Golubev, Timofey Bryksin
Jupyter notebooks represent a unique format for programming - a combination of code and Markdown with rich formatting, separated into individual cells. We propose to perceive a Jupyter Notebook cell as a simplified and raw version of a programming function. Similar to functions, Jupyter cells should strive to contain singular, self-contained actions. At the same time, research shows that real-world notebooks fail to do so and suffer from the lack of proper structure. To combat this, we propose ReSplit, an algorithm for an automatic re-splitting of cells in Jupyter notebooks. The algorithm analyzes definition-usage chains in the notebook and consists of two parts - merging and splitting the cells. We ran the algorithm on a large corpus of notebooks to evaluate its performance and its overall effect on notebooks, and evaluated it by human experts: we showed them several notebooks in their original and the re-split form. In 29.5% of cases, the re-split notebook was selected as the preferred way of perceiving the code. We analyze what influenced this decision and describe several individual cases in detail.
SEOct 1, 2021
The IntelliJ Platform: a Framework for Building Plugins and Mining Software DataZarina Kurbatova, Yaroslav Golubev, Vladimir Kovalenko et al.
In software engineering, a great number of new approaches are being actively researched, and a lot of tools are being developed based on them. These tools require a framework for their creation and an opportunity to be used by potential developers. Modern IDEs provide both. In this paper, we describe the main capabilities of the IntelliJ Platform that could be useful for researchers that are developing code analysis tools. To illustrate the benefits of using the platform, we describe several use cases that researchers might be interested in: mining software data, running machine learning models on code, recommending refactorings, and visualizing data in the IDE. We provide several examples of existing plugins that implement these cases. Finally, to make it easier to start working with the platform, we develop and provide simple plugins for each use case that could serve as a template for a new project.
SEAug 17, 2021
Infrastructure in Code: Towards Developer-Friendly Cloud ApplicationsVladislav Tankov, Dmitriy Valchuk, Yaroslav Golubev et al.
The popularity of cloud technologies has led to the development of a new type of applications that specifically target cloud environments. Such applications require a lot of cloud infrastructure to run, which brought about the Infrastructure as Code approach, where the infrastructure is also coded using a separate language in parallel to the main application. In this paper, we propose a new concept of Infrastructure in Code, where the infrastructure is deduced from the application code itself, without the need for separate specifications. We describe this concept, discuss existing solutions that can be classified as Infrastructure in Code and their limitations, and then present our own framework called Kotless - an extendable cloud-agnostic serverless framework for Kotlin that supports two cloud providers, three DSLs, and two runtimes. Finally, we showcase the usefulness of Kotless by demonstrating its efficiency in migrating an existing application to a serverless environment.
SEAug 10, 2021
PyNose: A Test Smell Detector For PythonTongjie Wang, Yaroslav Golubev, Oleg Smirnov et al.
Similarly to production code, code smells also occur in test code, where they are called test smells. Test smells have a detrimental effect not only on test code but also on the production code that is being tested. To date, the majority of the research on test smells has been focusing on programming languages such as Java and Scala. However, there are no available automated tools to support the identification of test smells for Python, despite its rapid growth in popularity in recent years. In this paper, we strive to extend the research to Python, build a tool for detecting test smells in this language, and conduct an empirical analysis of test smells in Python projects. We started by gathering a list of test smells from existing research and selecting test smells that can be considered language-agnostic or have similar functionality in Python's standard Unittest framework. In total, we identified 17 diverse test smells. Additionally, we searched for Python-specific test smells by mining frequent code change patterns that can be considered as either fixing or introducing test smells. Based on these changes, we proposed our own test smell called Suboptimal assert. To detect all these test smells, we developed a tool called PyNose in the form of a plugin to PyCharm, a popular Python IDE. Finally, we conducted a large-scale empirical investigation aimed at analyzing the prevalence of test smells in Python code. Our results show that 98% of the projects and 84% of the test suites in the studied dataset contain at least one test smell. Our proposed Suboptimal assert smell was detected in as much as 70.6% of the projects, making it a valuable addition to the list.
SEJul 15, 2021
One Thousand and One Stories: A Large-Scale Survey of Software RefactoringYaroslav Golubev, Zarina Kurbatova, Eman Abdullah AlOmar et al.
Despite the availability of refactoring as a feature in popular IDEs, recent studies revealed that developers are reluctant to use them, and still prefer the manual refactoring of their code. At JetBrains, our goal is to fully support refactoring features in IntelliJ-based IDEs and improve their adoption in practice. Therefore, we start by raising the following main questions. How exactly do people refactor code? What refactorings are the most popular? Why do some developers tend not to use convenient IDE refactoring tools? In this paper, we investigate the raised questions through the design and implementation of a survey targeting 1,183 users of IntelliJ-based IDEs. Our quantitative and qualitative analysis of the survey results shows that almost two-thirds of developers spend more than one hour in a single session refactoring their code; that refactoring types vary greatly in popularity; and that a lot of developers would like to know more about IDE refactoring features but lack the means to do so. These results serve us internally to support the next generation of refactoring features, as well as can help our research community to establish new directions in the refactoring usability research.
SEJun 3, 2021
Unsupervised Learning of General-Purpose Embeddings for Code ChangesMikhail Pravilov, Egor Bogomolov, Yaroslav Golubev et al.
Applying machine learning to tasks that operate with code changes requires their numerical representation. In this work, we propose an approach for obtaining such representations during pre-training and evaluate them on two different downstream tasks - applying changes to code and commit message generation. During pre-training, the model learns to apply the given code change in a correct way. This task requires only code changes themselves, which makes it unsupervised. In the task of applying code changes, our model outperforms baseline models by 5.9 percentage points in accuracy. As for the commit message generation, our model demonstrated the same results as supervised models trained for this specific task, which indicates that it can encode code changes well and can be improved in the future by pre-training on a larger dataset of easily gathered code changes.
SEMay 21, 2021
Changes from the Trenches: Should We Automate Them?Yaroslav Golubev, Jiawei Li, Viacheslav Bushev et al.
Code changes constitute one of the most important features of software evolution. Studying them can provide insights into the nature of software development and also lead to practical solutions - recommendations and automations of popular changes for developers. In our work, we developed a tool called PythonChangeMiner that allows to discover code change patterns in the histories of Python projects. We validated the tool and then employed it to discover patterns in the dataset of 120 projects from four different domains of software engineering. We manually categorized patterns that occur in more than one project from the standpoint of their structure and content, and compared different domains and patterns in that regard. We conducted a survey of the authors of the discovered changes: 82.9% of them said that they can give the change a name and 57.9% expressed their desire to have the changes automated, indicating the ability of the tool to discover valuable patterns. Finally, we interviewed 9 members of a popular integrated development environment (IDE) development team to estimate the feasibility of automating the discovered changes. It was revealed that independence from the context and high precision made a pattern a better candidate for automation. The patterns received mainly positive reviews and several were ranked as very likely for automation.
SEFeb 12, 2020
Multi-threshold token-based code clone detectionYaroslav Golubev, Viktor Poletansky, Nikita Povarov et al.
Clone detection plays an important role in software engineering. Finding clones within a single project introduces possible refactoring opportunities, and between different projects it could be used for detecting code reuse or possible licensing violations. In this paper, we propose a modification to bag-of-tokens based clone detection that allows detecting more clone pairs of greater diversity without losing precision by implementing a multi-threshold search, i.e. conducting the search several times, aimed at different groups of clones. To combat the increase in operation time that this approach brings about, we propose an optimization that allows to significantly decrease the overlap in detected clones between the searches. We evaluate the method for two different popular clone detection tools on two datasets of different sizes. The implementation of the technique allows to increase the number of detected clones by 40.5-56.6% for different datasets. BigCloneBench evaluation also shows that the recall of detecting Strongly Type-3 clones increases from 37.5% to 59.6%.