Toward a Benchmark Repository for Software Maintenance Tool Evaluations with Humans
This paper addresses the problem of inconsistent and potentially biased evaluation of software maintenance tools for researchers by proposing a standardized benchmark repository.
The authors propose creating a benchmark repository of projects, tasks, and metadata to standardize evaluations of software maintenance tools involving human participants. This aims to address issues of ad-hoc project selection that can bias results and hinder comparisons.
To evaluate software maintenance techniques and tools in controlled experiments with human participants, researchers currently use projects and tasks selected on an ad-hoc basis. This can unrealistically favor their tool, and it makes the comparison of results difficult. We suggest a gradual creation of a benchmark repository with projects, tasks, and metadata relevant for human-based studies. In this paper, we discuss the requirements and challenges of such a repository, along with the steps which could lead to its construction.