SEApr 14

Evaluating LLMs Code Reasoning Under Real-World Context

arXiv:2604.1288110.1h-index: 3

Predicted impact top 46% in SE · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers evaluating LLMs on code reasoning, this benchmark addresses the gap between simplified benchmarks and real-world code dependencies.

Existing code reasoning benchmarks use simplistic, LLM-generated snippets that fail to reflect real-world project complexity. R2Eval1 introduces 135 problems from ten Python projects with serialized compound and custom types, enabling more realistic LLM evaluation.

Code reasoning tasks are increasingly crucial to evaluating large language models (LLMs). Yet most existing benchmarks rely on simplistic, LLM-generated snippets or human-written solutions to code challenges and often restrict inputs and outputs to primitive types, failing to reflect the structure and dependencies of real-world projects. These simplifications limit their ability to measure practical generalizability. We present R2Eval1, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.

View on arXiv PDF

Similar