HCAIJul 3, 2025

Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation

UW
arXiv:2507.02306v15 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses the problem of expensive usability testing for human-centered design practitioners, offering a potentially more efficient alternative, though it is incremental in applying existing LLM capabilities to a specific domain.

The researchers tackled the problem of costly usability evaluation by developing a synthetic heuristic evaluation method using multimodal LLMs to analyze images and provide design feedback. They found that their method identified 73% and 77% of usability issues across two apps, outperforming experienced human evaluators who identified 57% and 63%.

Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs' ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation's performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen violations. Additionally, testing synthetic evaluations over time and accounts revealed stable performance. Overall, our work highlights the performance differences between human and LLM-driven evaluations, informing the design of synthetic heuristic evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes