SE AI CLMar 6, 2025

Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

Simin Chen, Pranav Pusarla, Baishakhi Ray

arXiv:2503.04149v223.229 citationsh-index: 4ICML

Originality Incremental advance

AI Analysis

This addresses the need for transparent and dynamic benchmarking in code LLMs to mitigate data contamination issues, though it is incremental as it builds on existing contamination concerns.

The authors tackled the problem of static benchmarking for code large language models being susceptible to data contamination by proposing a novel benchmarking suite that generates semantically equivalent variations of programming problems. Results show it effectively benchmarks reasoning capabilities under contamination risks while producing diverse problem sets for consistent evaluations across 21 Code LLMs.

The rapid evolution of code largelanguage models underscores the need for effective and transparent benchmarking of their reasoning capabilities. However, the current benchmarking approach heavily depends on publicly available, human-created datasets. The widespread use of these fixed benchmark datasets makes the benchmarking process to be static and thus particularly susceptible to data contamination, an unavoidable consequence of the extensive data collection processes used to train Code LLMs. Existing approaches that address data contamination often suffer from human effort limitations and imbalanced problem complexity. To tackle these challenges, we propose \tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination. Given a seed programming problem, \tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations. We introduce a dynamic data generation methods and conduct empirical studies on two seed datasets across 21 Code LLMs. Results show that \tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.

View on arXiv PDF

Similar