CL CV CYMar 5, 2024

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, Diyi Yang

Georgia Tech

arXiv:2403.03163v326.9107 citationsh-index: 15NAACL

Originality Synthesis-oriented

AI Analysis

This work provides a benchmark for automated front-end engineering, addressing a specific domain need but is incremental as it focuses on evaluation rather than new methods.

The authors tackled the problem of evaluating how well multimodal large language models can convert webpage screenshots into code, by creating Design2Code, a benchmark with 484 real-world webpages and automatic metrics, finding that models often struggle with recalling visual elements and generating correct layouts.

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language models (MLLMs) directly convert visual designs into code implementations. In this work, we construct Design2Code - the first real-world benchmark for this task. Specifically, we manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations to validate the performance ranking. To rigorously benchmark MLLMs, we test various multimodal prompting methods on frontier models such as GPT-4o, GPT-4V, Gemini, and Claude. Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.

View on arXiv PDF

Similar