SE LGJan 19

RM -RF: Reward Model for Run-Free Unit Test Evaluation

Elena Bruches, Daniil Grebenkin, Mikhail Klementev, Vadim Alperovich, Roman Derunets, Dari Baturova, Georgy Mkrtchyan, Oleg Sedukhin, Ivan Bondarenko, Nikolay Bushkov, Stanislav Moiseev

arXiv:2601.13097v17.22 citations

Originality Incremental advance

AI Analysis

This addresses the infrastructure cost and latency problem for developers and researchers working with large-scale test generation and RL-based code optimization, though it is an incremental improvement over existing evaluation methods.

The paper tackles the problem of slow and costly run-based evaluation of automatically generated unit tests by introducing RM-RF, a lightweight reward model that predicts execution-derived signals from source and test code alone, achieving an average F1 score of 0.69 across three targets.

We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.

View on arXiv PDF

Similar