AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education
For nursing educators, this work provides a proof-of-concept that automated analysis of egocentric video can yield pedagogically informative signals, though the negative correlation finding is preliminary and domain-specific.
The paper investigates whether visual observations from egocentric video can provide educationally meaningful signals for competency assessment in nursing simulation. Using a three-stage framework with frozen DINOv2 and HMM Viterbi decoding, they achieve 57.4% MOF in action recognition and find a negative correlation (rho = -0.524) between recognition accuracy and competency, suggesting that more competent students have diverse workflows that are harder to classify.
Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.