Vladimir Kovalenko

SE
h-index14
13papers
325citations
Novelty24%
AI Score34

13 Papers

SEOct 5, 2025Code
Challenge on Optimization of Context Collection for Code Completion

Dmitry Ustalov, Egor Bogomolov, Alexander Bezzubov et al.

The rapid advancement of workflows and methods for software engineering using AI emphasizes the need for a systematic evaluation and analysis of their ability to leverage information from entire projects, particularly in large code bases. In this challenge on optimization of context collection for code completion, organized by JetBrains in collaboration with Mistral AI as part of the ASE 2025 conference, participants developed efficient mechanisms for collecting context from source code repositories to improve fill-in-the-middle code completions for Python and Kotlin. We constructed a large dataset of real-world code in these two programming languages using permissively licensed open-source projects. The submissions were evaluated based on their ability to maximize completion quality for multiple state-of-the-art neural models using the chrF metric. During the public phase of the competition, nineteen teams submitted solutions to the Python track and eight teams submitted solutions to the Kotlin track. In the private phase, six teams competed, of which five submitted papers to the workshop.

SEAug 25, 2021Code
RefactorInsight: Enhancing IDE Representation of Changes in Git with Refactorings Information

Zarina Kurbatova, Vladimir Kovalenko, Ioana Savu et al.

Inspection of code changes is a time-consuming task that constitutes a big part of everyday work of software engineers. Existing IDEs provide little information about the semantics of code changes within the file editor view. Therefore developers have to track changes across multiple files, which is a hard task with large codebases. In this paper, we present RefactorInsight, a plugin for IntelliJ IDEA that introduces a smart diff for code changes in Java and Kotlin where refactorings are auto-folded and provided with their description, thus allowing users to focus on changes that modify the code behavior like bug fixes and new features. RefactorInsight supports three usage scenarios: viewing smart diffs with auto-folded refactorings and hints, inspecting refactorings in pull requests and in any specific commit in the project change history, and exploring the refactoring history of methods and classes. The evaluation shows that commit processing time is acceptable: on median it is less than 0.2 seconds, which delay does not disrupt developers' IDE workflows. RefactorInsight is available at https://github.com/JetBrains-Research/RefactorInsight. The demonstration video is available at https://youtu.be/-6L2AKQ66nA.

SEMar 17, 2021Code
TNM: A Tool for Mining of Socio-Technical Data from Git Repositories

Nikolai Sviridov, Mikhail Evtikhiev, Vladimir Kovalenko

Networks of collaboration between engineers are reflected in traces of developers' activity in version control systems (VCSs). Extracting data from Git repositories is an essential task for researchers and practitioners working on socio-technical analysis, but it requires substantial engineering work. With increasing interest in analysing socio-technical data and applying it in practice, there are no flexible and easily reusable tools to retrieve socio-technical information from VCSs. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from their core research questions. In this paper, we present TNM -- an open-source tool for mining socio-technical data from Git repositories. TNM is fast, flexible, and easily extensible. TNM is available on GitHub: https://github.com/JetBrains-Research/tnm

SEJul 6, 2020Code
Sosed: a tool for finding similar software projects

Egor Bogomolov, Yaroslav Golubev, Artyom Lobanov et al.

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

SEFeb 3, 2022
Bus Factor In Practice

Elgun Jabrayilzade, Mikhail Evtikhiev, Eray Tüzün et al.

Bus factor is a metric that identifies how resilient is the project to the sudden engineer turnover. It states the minimal number of engineers that have to be hit by a bus for a project to be stalled. Even though the metric is often discussed in the community, few studies consider its general relevance. Moreover, the existing tools for bus factor estimation focus solely on the data from version control systems, even though there exists other channels for knowledge generation and distribution. With a survey of 269 engineers, we find that the bus factor is perceived as an important problem in collective development, and determine the highest impact channels of knowledge generation and distribution in software development teams. We also propose a multimodal bus factor estimation algorithm that uses data on code reviews and meetings together with the VCS data. We test the algorithm on 13 projects developed at JetBrains and compared its results to the results of the state-of-the-art tool by Avelino et al. against the ground truth collected in a survey of the engineers working on these projects. Our algorithm is slightly better in terms of both predicting the bus factor as well as key developers compared to the results of Avelino et al. Finally, we use the interviews and the surveys to derive a set of best practices to address the bus factor issue and proposals for the possible bus factor assessment tool.

SEOct 1, 2021
The IntelliJ Platform: a Framework for Building Plugins and Mining Software Data

Zarina Kurbatova, Yaroslav Golubev, Vladimir Kovalenko et al.

In software engineering, a great number of new approaches are being actively researched, and a lot of tools are being developed based on them. These tools require a framework for their creation and an opportunity to be used by potential developers. Modern IDEs provide both. In this paper, we describe the main capabilities of the IntelliJ Platform that could be useful for researchers that are developing code analysis tools. To illustrate the benefits of using the platform, we describe several use cases that researchers might be interested in: mining software data, running machine learning models on code, recommending refactorings, and visualizing data in the IDE. We provide several examples of existing plugins that implement these cases. Finally, to make it easier to start working with the platform, we develop and provide simple plugins for each use case that could serve as a template for a new project.

SEMar 23, 2021
PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code

Egor Spirin, Egor Bogomolov, Vladimir Kovalenko et al.

The application of machine learning algorithms to source code has grown in the past years. Since these algorithms are quite sensitive to input data, it is not surprising that researchers experiment with input representations. Nowadays, a popular starting point to represent code is abstract syntax trees (ASTs). Abstract syntax trees have been used for a long time in various software engineering domains, and in particular in IDEs. The API of modern IDEs allows to manipulate and traverse ASTs, resolve references between code elements, etc. Such algorithms can enrich ASTs with new data and therefore may be useful in ML-based code analysis. In this work, we present PSIMiner - a tool for processing PSI trees from the IntelliJ Platform. PSI trees contain code syntax trees as well as functions to work with them, and therefore can be used to enrich code representation using static analysis algorithms of modern IDEs. To showcase this idea, we use our tool to infer types of identifiers in Java ASTs and extend the code2seq model for the method name prediction problem.

SEDec 9, 2020
TaskTracker-tool: a Toolkit for Tracking of Code Snapshots and Activity Data During Solution of Programming Tasks

Elena Lyulina, Anastasiia Birillo, Vladimir Kovalenko et al.

The process of writing code and use of features in an integrated development environment (IDE) is a fruitful source of data in computing education research. Existing studies use records of students' actions in the IDE, consecutive code snapshots, compilation events, and others, to gain deep insight into the process of student programming. In this paper, we present a set of tools for collecting and processing data of student activity during problem-solving. The first tool is a plugin for IntelliJ-based IDEs (PyCharm, IntelliJ IDEA, CLion). By capturing snapshots of code and IDE interaction data, it allows to analyze the process of writing code in different languages -- Python, Java, Kotlin, and C++. The second tool is designed for the post-processing of data collected by the plugin and is capable of basic analysis and visualization. To validate and showcase the toolkit, we present a dataset collected by our tools. It consists of records of activity and IDE interaction events during solution of programming tasks by 148 participants of different ages and levels of programming experience. We propose several directions for further exploration of the dataset.

SEMay 3, 2020
Pandemic Programming: How COVID-19 affects software developers and how their organizations can help

Paul Ralph, Sebastian Baltes, Gianisa Adisaputri et al.

Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support.

SEApr 3, 2020
Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler

Timofey Bryksin, Victor Petukhov, Ilya Alexin et al.

In this work, we apply anomaly detection to source code and bytecode to facilitate the development of a programming language and its compiler. We define anomaly as a code fragment that is different from typical code written in a particular programming language. Identifying such code fragments is beneficial to both language developers and end users, since anomalies may indicate potential issues with the compiler or with runtime performance. Moreover, anomalies could correspond to problems in language design. For this study, we choose Kotlin as the target programming language. We outline and discuss approaches to obtaining vector representations of source code and bytecode and to the detection of anomalies across vectorized code snippets. The paper presents a method that aims to detect two types of anomalies: syntax tree anomalies and so-called compiler-induced anomalies that arise only in the compiled bytecode. We describe several experiments that employ different combinations of vectorization and anomaly detection techniques and discuss types of detected anomalies and their usefulness for language developers. We demonstrate that the extracted anomalies and the underlying extraction technique provide additional value for language development.

SEFeb 10, 2020
Building Implicit Vector Representations of Individual Coding Style

Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin et al.

With the goal of facilitating team collaboration, we propose a new approach to building vector representations of individual developers by capturing their individual contribution style, or coding style. Such representations can find use in the next generation of software development team collaboration tools, for example by enabling the tools to track knowledge transfer in teams. The key idea of our approach is to avoid using explicitly defined metrics of coding style and instead build the representations through training a model for authorship recognition and extracting the representations of individual developers from the trained model. By empirically evaluating the output of our approach, we find that implicitly built individual representations reflect some properties of team structure: developers who report learning from each other are represented closer to each other.

SEJan 30, 2020
Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

Egor Bogomolov, Vladimir Kovalenko, Yurii Rebryk et al.

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

DATA-ANNov 30, 2016
Classifiers for centrality determination in proton-nucleus and nucleus-nucleus collisions

Igor Altsybeev, Vladimir Kovalenko

Centrality, as a geometrical property of the collision, is crucial for the physical interpretation of nucleus-nucleus and proton-nucleus experimental data. However, it cannot be directly accessed in event-by-event data analysis. Common methods for centrality estimation in A-A and p-A collisions usually rely on a single detector (either on the signal in zero-degree calorimeters or on the multiplicity in some semi-central rapidity range). In the present work, we made an attempt to develop an approach for centrality determination that is based on machine-learning techniques and utilizes information from several detector subsystems simultaneously. Different event classifiers are suggested and evaluated for their selectivity power in terms of the number of nucleons-participants and the impact parameter of the collision. Finer centrality resolution may allow to reduce impact from so-called volume fluctuations on physical observables being studied in heavy-ion experiments like ALICE at the LHC and fixed target experiment NA61/SHINE on SPS.