Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

arXiv:2603.04601v111.55 citationsh-index: 24

Originality Highly original

AI Analysis

This work addresses the problem of evaluating AI models on the complete process of building working web applications for AI developers and researchers, highlighting the current limitations of frontier models in this domain.

This paper introduces Vibe Code Bench, a new benchmark for evaluating AI models on end-to-end web application development from scratch. The benchmark consists of 100 web application specifications and evaluates deployed applications using an autonomous browser agent. The best of 16 frontier models achieved only 58.0% accuracy on the test split, indicating that reliable end-to-end application development is still a significant challenge.

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

View on arXiv PDF

Similar