SE CLMar 9, 2025

Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models

Batu Guan, Xiao Wu, Yuanyuan Yuan, Shaohua Li

arXiv:2503.06643v18 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the issue of reliable evaluation for code language models, particularly for researchers and practitioners, by providing a method to mitigate data contamination, though it is incremental in improving benchmark robustness.

The paper tackles the problem of code benchmarks becoming less useful due to data contamination from training, by introducing a dynamic benchmarking framework that transforms inputs with semantic-preserving mutations. The result shows that all ten evaluated models perform significantly worse on the dynamic benchmarks, with shifts in model rankings, and the benchmarks effectively resist data contamination.

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.

View on arXiv PDF

Similar