SkillSight: Efficient First-Person Skill Assessment with Gaze
This addresses the problem of efficient skill assessment for users learning physical skills via smart glasses, representing a novel approach with practical power savings.
The paper tackled automatic skill assessment from first-person data by introducing SkillSight, which uses a two-stage framework to model gaze and video for predicting skill level and then distills a gaze-only model. The gaze-only student model achieved high accuracy with 73x less power than competing methods.
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.