StackEval: Benchmarking LLMs in Coding Assistance
This work provides benchmarks for researchers and developers to evaluate LLMs in coding assistance, though it is incremental as it builds on existing evaluation methods with new datasets.
The authors tackled the problem of evaluating language models in coding assistance tasks by creating two comprehensive benchmarks derived from Stack Overflow content, which revealed insights into LLMs' capabilities and limitations, particularly with new content. They also assessed LLMs as judges for coding tasks, identifying potential biases such as favoring their own solutions.
We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance. To ensure reproducibility, we publicly share our datasets and evaluation code at https://github.com/ProsusAI/stack-eval .