SE AI CLOct 5, 2025

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

arXiv:2510.04363v28.02 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This provides a reproducible benchmark for assessing web automation script generation, though it reveals current limitations in handling complex tasks.

The researchers tackled the problem of evaluating whether large language models can synthesize reusable browser-automation programs from natural-language goals, finding that models like GPT-4o-mini achieved 96.8% success on simple tasks but failed completely (0.0%) on complex workflows.

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.

View on arXiv PDF Code

Similar