SELGFeb 13, 2024

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

CambridgeMicrosoft
arXiv:2402.08699v227 citationsh-index: 29ICML
AI Analysis

This addresses the need for scalable and cost-effective evaluation of code LLMs for researchers and practitioners, though it is incremental as it builds on existing evaluation concepts.

The paper tackles the problem of evaluating code large language models (LLMs) by introducing round-trip correctness (RTC) as an unsupervised method, which strongly correlates with performance on existing benchmarks and enables evaluation across a broader spectrum of real-world software domains without human curation.

To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes