Yuki Sakai

CV
h-index11
3papers
2citations
Novelty27%
AI Score26

3 Papers

CVSep 26, 2025
EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking

Yuki Sakai, Ryosuke Furuta, Juichun Yen et al.

Analyzing instructional interactions between an instructor and a learner who are co-present in the same physical space is a critical problem for educational support and skill transfer. Yet such face-to-face instructional scenes have not been systematically studied in computer vision. We identify two key reasons: i) the lack of suitable datasets and ii) limited analytical techniques. To address this gap, we present a new egocentric video dataset of face-to-face instruction and provide ground-truth annotations for two fundamental tasks that serve as a first step toward a comprehensive understanding of instructional interactions: procedural step segmentation and conversation-state classification. Using this dataset, we benchmark multimodal large language models (MLLMs) against conventional task-specific models. Since face-to-face instruction involves multiple modalities (speech content and prosody, gaze and body motion, and visual context), effective understanding requires methods that handle verbal and nonverbal communication in an integrated manner. Accordingly, we evaluate recently introduced MLLMs that jointly process images, audio, and text. This evaluation quantifies the extent to which current machine learning models understand face-to-face instructional scenes. In experiments, MLLMs outperform specialized baselines even without task-specific fine-tuning, suggesting their promise for holistic understanding of instructional interactions.

CVMay 30, 2025
Leadership Assessment in Pediatric Intensive Care Unit Team Training

Liangyang Ouyang, Yuki Sakai, Ryosuke Furuta et al.

This paper addresses the task of assessing PICU team's leadership skills by developing an automated analysis framework based on egocentric vision. We identify key behavioral cues, including fixation object, eye contact, and conversation patterns, as essential indicators of leadership assessment. In order to capture these multimodal signals, we employ Aria Glasses to record egocentric video, audio, gaze, and head movement data. We collect one-hour videos of four simulated sessions involving doctors with different roles and levels. To automate data processing, we propose a method leveraging REMoDNaV, SAM, YOLO, and ChatGPT for fixation object detection, eye contact detection, and conversation classification. In the experiments, significant correlations are observed between leadership skills and behavioral metrics, i.e., the output of our proposed methods, such as fixation time, transition patterns, and direct orders in speech. These results indicate that our proposed data collection and analysis framework can effectively solve skill assessment for training PICU teams.

NCJul 3, 2019
Quantitative evaluation of sense of discrepancy to operation response using event-related potential

Kazutaka Ueda, Yuki Sakai, Hideyoshi Yanagisawa

This study aimed to develop a method to evaluate the sense of discrepancy to the operation response quantitatively. We examined the availability of event-related potential (P300), which is considered to reflect attention to stimulation, to evaluate the sense of discrepancy to the product response to the user's action. In the experiment using subjective evaluation and P300 to investigate the sense of discrepancy due to the lack of operation response (sound and vibration) to the shutter operation of the mirrorless single-lens camera, it was confirmed that P300 amplitude corresponds to the degree of the subjective sense of discrepancy. Our results showed that the P300 amplitude could evaluate the sense of discrepancy to the operation response.