CVFeb 6

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

arXiv:2602.06663v14 citationsh-index: 16
Originality Synthesis-oriented
AI Analysis

This work addresses the underexplored potential of UMMs in supporting computer-use planning tasks, which are relevant for daily life applications, but it is incremental as it focuses on benchmarking rather than advancing model capabilities.

The authors tackled the problem of evaluating unified multimodal models (UMMs) for planning-oriented image generation and editing in computer-use tasks, resulting in the creation of PlanViz, a benchmark with three new sub-tasks and a task-adaptive score called PlanScore.

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes