WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation
This work addresses the need for better benchmarks and evaluation in web generation for AI researchers and developers, though it is incremental in building on existing LLM capabilities.
The paper tackles the problem of low data quality and evaluation granularity in LLM-based web generation by introducing WebGen-V, a benchmark and framework that uses structured, section-wise data representation and multimodal evaluation, achieving validated effectiveness through experiments with state-of-the-art LLMs.
Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.