Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

arXiv:2605.0734264.9

AI Analysis

For researchers evaluating LLM code generation in complex domains, this work provides a multi-axis evaluation framework that reveals the inadequacy of compile-pass rate as a sole metric.

The paper demonstrates that compile-pass rate is a misleading metric for LLM-generated executable game scenes, proposing a four-axis evaluation protocol (Mage). Results show compile rate anti-correlates with functional correctness: direct NL-to-C# generation achieves 43% runtime-pass rate but near-zero mechanism adherence (F1≈0.12), while structural IR conditioning halves runtime rate but recovers domain-faithful structure (F1 up to 1.00).

Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.

View on arXiv PDF

Similar