CV AI CL LGMar 3, 2025

Advancing vision-language models in front-end development via data synthesis

Tong Ge, Yashu Liu, Jieping Ye, Tianyi Li, Chao Wang

arXiv:2503.01619v113.16 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a domain-specific problem for front-end developers by providing an incremental improvement in code generation from images.

The paper tackled the challenge of generating accurate and functional front-end code from design images using vision-language models, particularly for frameworks like React and Vue, by proposing a reflective agentic workflow that synthesizes high-quality image-text data, and demonstrated that their model Flame achieved improved performance in generating React code as measured by the pass@k metric.

Modern front-end (FE) development, especially when leveraging the unique features of frameworks like React and Vue, presents distinctive challenges. These include managing modular architectures, ensuring synchronization between data and visual outputs for declarative rendering, and adapting reusable components to various scenarios. Such complexities make it particularly difficult for state-of-the-art large vision-language models (VLMs) to generate accurate and functional code directly from design images. To address these challenges, we propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of FE development. This workflow automates the extraction of self-contained\footnote{A \textbf{self-contained} code snippet is one that encapsulates all necessary logic, styling, and dependencies, ensuring it functions independently without requiring external imports or context.} code snippets from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code. To further expand the scope and utility of the synthesis, we introduce three data synthesis strategies: Evolution-based synthesis, which enables scalable and diverse dataset expansion; Waterfall-Model-based synthesis, which generates logically coherent code derived from system requirements; and Additive Development synthesis, which iteratively increases the complexity of human-authored components. We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the $\text{pass}@k$ metric. Our results suggest that a code VLM trained to interpret images before code generation may achieve better performance.

View on arXiv PDF

Similar