CL LGMay 8, 2025

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu

arXiv:2505.05026v38.32 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation of MLLMs in UI/UX design understanding, which could benefit designers and businesses by enabling behavior-aware interface optimization, though it is incremental as it builds on existing UI quality evaluation research.

The paper tackles the problem of evaluating Multimodal Large Language Models (MLLMs) on their ability to understand how user interface (UI) design influences user behavior, beyond surface-level features, by introducing WiserUI-Bench, a benchmark with 300 real-world UI image pairs from A/B tests and 684 expert rationales, and finds that current models show limited nuanced reasoning in predicting effective designs and explaining their effectiveness.

User interface (UI) design goes beyond visuals, guiding user behavior and overall user experience (UX). Strategically crafted interfaces, for example, can boost sign-ups and drive business sales, underscoring the shift toward UI/UX as a unified design concept. While recent studies have explored UI quality evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking behavior-oriented aspects. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for assessing models' multimodal understanding of UI/UX design. It includes 300 diverse real-world UI image pairs, each consisting of two design variants A/B-tested at scale by actual companies, where one was empirically validated to steer more user actions than the other. Each pair is accompanied one or more of 684 expert-curated rationales that capture key factors behind each winning design's effectiveness, spanning diverse cognitive dimensions of UX. Our benchmark supports two core tasks: (1) selecting the more effective UI/UX design by predicting the A/B test verified winner and (2) assessing how well a model, given the winner, can explain its effectiveness in alignment with expert reasoning. Experiments across several MLLMs show that current models exhibit limited nuanced reasoning about UI/UX design and its behavioral impact. We believe our work will foster research in UI/UX understanding and enable broader applications such as behavior-aware interface optimization.

View on arXiv PDF

Similar