SE AIJan 5

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie

arXiv:2601.02430v115.59 citations

Originality Incremental advance

AI Analysis

This provides a tool for LLM developers to optimize models for web app generation, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the challenge of benchmarking large language models for web application generation by introducing WebCoderBench, a benchmark with 1,572 real user requirements and 24 evaluation metrics, and found no dominant model across all metrics in experiments with 12 LLMs and 2 agents.

Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation. WebCoderBench comprises 1,572 real user requirements, covering diverse modalities and expression styles that reflect realistic user intentions. WebCoderBench provides 24 fine-grained evaluation metrics across 9 perspectives, combining rule-based and LLM-as-a-judge paradigm for fully automated, objective, and general evaluation. Moreover, WebCoderBench adopts human-preference-aligned weights over metrics to yield interpretable overall scores. Experiments across 12 representative LLMs and 2 LLM-based agents show that there exists no dominant model across all evaluation metrics, offering an opportunity for LLM developers to optimize their models in a targeted manner for a more powerful version.

View on arXiv PDF

Similar