LG CL SEApr 20, 2025

LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu

arXiv:2504.14655v153 citationsh-index: 5

Originality Incremental advance

AI Analysis

This provides a contamination-free evaluation and efficient training framework for code-generation models, addressing a domain-specific need in AI research.

The authors tackled the lack of reasoning-focused coding benchmarks and self-contained training testbeds for LLMs by introducing LeetCodeDataset, a high-quality dataset with temporal splits and rich metadata. Results show reasoning models significantly outperform non-reasoning ones, and supervised fine-tuning with only 2.6K model-generated solutions achieves performance comparable to using 110K samples.

We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

View on arXiv PDF

Similar