CLFeb 28, 2025

ProBench: Benchmarking Large Language Models in Competitive Programming

Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, Deyi Xiong

arXiv:2502.20868v117 citationsh-index: 11

Originality Incremental advance

AI Analysis

This provides a more rigorous evaluation framework for researchers and developers working on reasoning-oriented LLMs in programming, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the problem of inadequate benchmarks for assessing advanced large language models in code reasoning by creating ProBench, a competitive programming benchmark based on real contest problems, which showed that QwQ-32B-Preview achieved the best score of 20.93, outperforming other models including DeepSeek-V3 at 16.38.

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.

View on arXiv PDF

Similar