SE CL LGApr 7, 2025

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

Rem Yang, Julian Dai, Nikos Vasilakis, Martin Rinard

arXiv:2504.05518v114.96 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the generalization problem in code reasoning for AI researchers and developers, providing insights into model performance across different program classes, though it is incremental in nature.

The study assessed how large language models generalize their code reasoning abilities across various program types, finding that while earlier models relied on pattern matching, the latest models demonstrated strong generalization capabilities.

We assess how the code reasoning abilities of large language models (LLMs) generalize to different kinds of programs. We present techniques for obtaining in- and out-of-distribution programs with different characteristics: code sampled from a domain-specific language, code automatically generated by an LLM, code collected from competitive programming contests, and mutated versions of these programs. We also present an experimental methodology for evaluating LLM generalization by comparing their performance on these programs. We perform an extensive evaluation across 10 state-of-the-art models from the past year, obtaining insights into their generalization capabilities over time and across different classes of programs. Our results highlight that while earlier models exhibit behavior consistent with pattern matching, the latest models exhibit strong generalization abilities on code reasoning.

View on arXiv PDF

Similar