LGCLSEApr 20, 2025

LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

arXiv:2504.14655v153 citationsh-index: 5
Originality Incremental advance
AI Analysis

This provides a contamination-free evaluation and efficient training framework for code-generation models, addressing a domain-specific need in AI research.

The authors tackled the lack of reasoning-focused coding benchmarks and self-contained training testbeds for LLMs by introducing LeetCodeDataset, a high-quality dataset with temporal splits and rich metadata. Results show reasoning models significantly outperform non-reasoning ones, and supervised fine-tuning with only 2.6K model-generated solutions achieves performance comparable to using 110K samples.

We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes