SE AIJul 25, 2023

Predicting Code Coverage without Execution

Michele Tufano, Shubham Chandel, Anisha Agarwal, Neel Sundaresan, Colin Clement

Microsoft

arXiv:2307.13383v110.713 citationsh-index: 38Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the high computational cost of code coverage for software engineers, but it is incremental as it introduces a new benchmark rather than a breakthrough method.

The paper tackles the problem of predicting code coverage without execution, which is resource-intensive, by proposing a novel benchmark task for large language models (LLMs) and evaluating four state-of-the-art models on a curated dataset, with performance reported but no specific numbers provided in the abstract.

Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks.

View on arXiv PDF Code

Similar