Robert Heumüller

SE
h-index23
4papers
5citations
Novelty35%
AI Score34

4 Papers

SEDec 17, 2025
On Assessing the Relevance of Code Reviews Authored by Generative Models

Robert Heumüller, Frank Ortmeier

The use of large language models like ChatGPT in code review offers promising efficiency gains but also raises concerns about correctness and safety. Existing evaluation methods for code review generation either rely on automatic comparisons to a single ground truth, which fails to capture the variability of human perspectives, or on subjective assessments of "usefulness", a highly ambiguous concept. We propose a novel evaluation approach based on what we call multi-subjective ranking. Using a dataset of 280 self-contained code review requests and corresponding comments from CodeReview StackExchange, multiple human judges ranked the quality of ChatGPT-generated comments alongside the top human responses from the platform. Results show that ChatGPT's comments were ranked significantly better than human ones, even surpassing StackExchange's accepted answers. Going further, our proposed method motivates and enables more meaningful assessments of generative AI's performance in code review, while also raising awareness of potential risks of unchecked integration into review processes.

SEAug 25, 2025
Previously on... Automating Code Review

Robert Heumüller, Frank Ortmeier

Modern Code Review (MCR) is a standard practice in software engineering, yet it demands substantial time and resource investments. Recent research has increasingly explored automating core review tasks using machine learning (ML) and deep learning (DL). As a result, there is substantial variability in task definitions, datasets, and evaluation procedures. This study provides the first comprehensive analysis of MCR automation research, aiming to characterize the field's evolution, formalize learning tasks, highlight methodological challenges, and offer actionable recommendations to guide future research. Focusing on the primary code review tasks, we systematically surveyed 691 publications and identified 24 relevant studies published between May 2015 and April 2024. Each study was analyzed in terms of tasks, models, metrics, baselines, results, validity concerns, and artifact availability. In particular, our analysis reveals significant potential for standardization, including 48 task metric combinations, 22 of which were unique to their original paper, and limited dataset reuse. We highlight challenges and derive concrete recommendations for examples such as the temporal bias threat, which are rarely addressed so far. Our work contributes to a clearer overview of the field, supports the framing of new research, helps to avoid pitfalls, and promotes greater standardization in evaluation practices.

SEApr 16, 2021
Learning to Boost the Efficiency of Modern Code Review

Robert Heumüller

Modern Code Review (MCR) is a standard in all kinds of organizations that develop software. MCR pays for itself through perceived and proven benefits in quality assurance and knowledge transfer. However, the time invest in MCR is generally substantial. The goal of this thesis is to boost the efficiency of MCR by developing AI techniques that can partially replace or assist human reviewers. The envisioned techniques distinguish from existing MCR-related AI models in that we interpret these challenges as graph-learning problems. This should allow us to use state-of-science algorithms from that domain to learn coding and reviewing standards directly from existing projects. The required training data will be mined from online repositories and the experiments will be designed to use standard, quantitative evaluation metrics. This research proposal defines the motivation, research-questions, and solution components for the thesis, and gives an overview of the relevant related work.

SEAug 1, 2020
Guided Pattern Mining for API Misuse Detection by Change-Based Code Analysis

Sebastian Nielebock, Robert Heumüller, Kevin Michael Schott et al.

Lack of experience, inadequate documentation, and sub-optimal API design frequently cause developers to make mistakes when re-using third-party implementations. Such API misuses can result in unintended behavior, performance losses, or software crashes. Therefore, current research aims to automatically detect such misuses by comparing the way a developer used an API to previously inferred patterns of the correct API usage. While research has made significant progress, these techniques have not yet been adopted in practice. In part, this is due to the lack of a process capable of seamlessly integrating with software development processes. Particularly, existing approaches do not consider how to collect relevant source code samples from which to infer patterns. In fact, an inadequate collection can cause API usage pattern miners to infer irrelevant patterns which leads to false alarms instead of finding true API misuses. In this paper, we target this problem (a) by providing a method that increases the likelihood of finding relevant and true-positive patterns concerning a given set of code changes and agnostic to a concrete static, intra-procedural mining technique and (b) by introducing a concept for just-in-time API misuse detection which analyzes changes at the time of commit. Particularly, we introduce different, lightweight code search and filtering strategies and evaluate them on two real-world API misuse datasets to determine their usefulness in finding relevant intra-procedural API usage patterns. Our main results are (1) commit-based search with subsequent filtering effectively decreases the amount of code to be analyzed, (2) in particular method-level filtering is superior to file-level filtering, (3) project-internal and project-external code search find solutions for different types of misuses and thus are complementary, (4) incorporating prior knowledge of the misused [...]