SEAICLSep 19, 2024

A Case Study of Web App Coding with OpenAI Reasoning Models

arXiv:2409.13773v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This study highlights performance variability in reasoning models for coding, which is an incremental insight for developers and researchers in AI-assisted programming.

The paper investigates the performance of OpenAI's reasoning models (o1-preview and o1-mini) on coding tasks, finding they achieve state-of-the-art results on the WebApp1K benchmark but decline significantly on a harder version (WebApp1K-Duo), falling behind Claude 3.5 and failing on atypical test cases.

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes