How to benchmark: the Measure-Explain-Test-Improve loop
It provides a structured approach for researchers with no prior benchmarking experience to build solid performance evaluations, addressing a common weakness in programming language research.
The paper presents a methodology called the Measure-Explain-Test-Improve loop for conducting performance benchmarks in computer science research, particularly for programming language research, aiming to improve the quality of performance evaluation.
I would like to share recommendations on how to do performance benchmarks for the purpose of computer science research evaluation. Research in my field (programming language research) often involves performance considerations, but it is typically not the main tool used to evaluate our research (typically we evaluate via formal statements and their proofs, experience writing large or interesting examples, or systematic comparison of expressivity, feature set, etc.). My impression is that, as a result, we tend to not do our performance evaluation very well. In the present document I will try to explain a methodology to do benchmarking correctly (I hope!). People with no former benchmarking experience should be able to build solid performance evaluation as part of their research. I explain the justification for each aspect along the way.