SEAIJul 30, 2024

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

arXiv:2408.00019v110 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This provides a practical tool for evaluating and improving LLMs in web development, but it is incremental as it focuses on benchmarking rather than novel methods.

The authors introduced WebApp1K, a benchmark to measure LLM ability in web app code generation, finding that open-source LLMs perform nearly as well as top models like GPT-4o and Claude 3.5, with model size strongly correlating to code correctness.

We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The benchmark is lightweight and easy to run. We present the initial version of WebApp1K, and share our findings of running the benchmark against the latest frontier LLMs. First, open source LLMs deliver impressive performance, closely trailing behind GPT-4o and Claude 3.5. Second, model size has strong correlation with code correctness. Third, no prompting techniques have been found to lift performance either universally to all models, or significantly to a single model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes