HCMay 16

Substantial, Decomposable, and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks

Yayuan Li, Chenglin Li, Jingying Wang, Filippos Bellos, Anhong Guo, Jason J. Corso

arXiv:2605.1718463.5

Predicted impact top 16% in HC · last 90 daysOriginality Incremental advance

AI Analysis

For designers and evaluators of instructional videos, it reveals that visual context mismatch substantially impairs task performance in a decomposable but user-invisible manner.

This paper investigates how visual context misalignment in instructional videos affects task performance, finding that fully aligned videos improve completion quality by 11.1% and speed by 15.5%. Four key visual attributes are identified and shown to degrade performance when misaligned, yet users are unaware of the impact.

Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user's real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) -- fully aligned with the user's visual perception -- to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual context attributes responsible for the effect: Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context. Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation. However, we find users fail to perceive the effect of single-attribute misalignment on task performance despite clear drops in objective measurement. Visual context misalignment is substantial, decomposable, and invisible to the user. These findings help understand the effect of visual context mismatch and how we should evaluate instructional videos for physical task guidance.

View on arXiv PDF

Similar