SEAISep 16, 2025

SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems

arXiv:2509.14281v1h-index: 8Has Code
Originality Incremental advance
AI Analysis

This addresses a bottleneck in training code models for developers and researchers, though it is incremental as it builds on existing data extraction methods.

The paper tackles the scarcity of real-world coding problems for advancing code large language models by proposing a framework that synthesizes code problems from datasets like Stack Overflow and Kaggle, achieving superior performance over state-of-the-art models across benchmarks.

Significant advancements have been made in the capabilities of code large language models, leading to their rapid adoption and application across a wide range of domains. However, their further advancements are often constrained by the scarcity of real-world coding problems. To bridge this gap, we propose a novel framework for synthesizing code problems that emulate authentic real-world scenarios. This framework systematically integrates domain knowledge, domain skills, and coding skills, all of which are meticulously extracted from real-world programming-related datasets, including Stack Overflow and Kaggle. The extracted elements serve as the foundational building blocks for constructing code problems. To align the generated problems with practical applications, application scenarios are also mined from the aforementioned datasets. These scenarios are then utilized to construct a scenario-centric graph that interconnects domain knowledge, domain skills, and coding skills. Based on this structured representation, a sampling strategy on the graph is designed, which effectively controls the generation of a code problem with complexity and diversity, reflects real-world challenges. Experimental results demonstrate that the proposed method consistently achieves superior performance over state-of-the-art open-source large language models of varying sizes and functionalities, including both coders and general-purpose models, across a diverse set of real-world benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes