AIJun 12, 2025

OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics

Yaoming Zhu, Junxin Wang, Yiyang Li, Lin Qiu, ZongYu Wang, Jun Xu, Xuezhi Cao, Yuhuai Wei, Mingshi Wang, Xunliang Cai, Rong Ma

arXiv:2506.10481v112.45 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This provides a more challenging benchmark for evaluating algorithmic reasoning in large language models, addressing saturation in conventional benchmarks.

The authors introduced OIBench, a challenging olympiad-level informatics dataset with 250 original problems, and found that current state-of-the-art models outperform most human participants in correctness and efficiency, though they remain suboptimal compared to canonical solutions.

As models become increasingly sophisticated, conventional algorithm benchmarks are increasingly saturated, underscoring the need for more challenging benchmarks to guide future improvements in algorithmic reasoning. This paper introduces OIBench, a high-quality, private, and challenging olympiad-level informatics dataset comprising 250 carefully curated original problems. We detail the construction methodology of the benchmark, ensuring a comprehensive assessment across various programming paradigms and complexities, and we demonstrate its contamination-resistant properties via experiments. We propose Time/Space Completion Curves for finer-grained efficiency analysis and enable direct human-model comparisons through high-level participant evaluations. Our experiments reveal that while open-source models lag behind closed-source counterparts, current SOTA models already outperform most human participants in both correctness and efficiency, while still being suboptimal compared to the canonical solutions. By releasing OIBench as a fully open-source resource (https://huggingface.co/datasets/AGI-Eval/OIBench), we hope this benchmark will contribute to advancing code reasoning capabilities for future LLMs.

View on arXiv PDF

Similar