SEAIApr 20

SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

arXiv:2604.1982554.7
Predicted impact top 44% in SE · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of unreliable code generation for developers and researchers, with incremental improvements over existing methods.

The paper tackles the Mental-Reality Gap in LLM code generation, where models hallucinate execution traces and validate buggy code, by proposing SolidCoder, which forces edge-case awareness and uses sandboxed execution instead of mental simulation. With GPT-4o, it achieves state-of-the-art pass@1 performance of 95.7% on HumanEval, 77.0% on CodeContests, and 26.7% on APPS.

State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap -- where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code). We propose SolidCoder with a simple principle: don't imagine -- execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1 performance: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes