SE AI CL LGJan 29, 2024

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, Aditya Kanade

arXiv:2401.15963v318.425 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses a gap in evaluating code LMs for real-world software engineering needs, but it is incremental as it builds on existing benchmarks by adding non-functional requirements.

The authors tackled the problem that existing benchmarks for code language models (LMs) focus only on functional correctness, ignoring non-functional requirements like efficiency and security, by proposing a new benchmark called NoFunEval and a prompting method called Coding Concepts (CoCo). They found that 27 code LMs generally falter on this benchmark, with low classification accuracy even on functional-correctness instances from HumanEval, hinting at fundamental blindspots in their training.

Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.

View on arXiv PDF

Similar