Jiayi Xiang

CL
h-index5
3papers
4citations
Novelty32%
AI Score43

3 Papers

71.4HCJun 1
Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment

Xiyang Huang, Renxiong Wei, Yihuai Xu et al.

This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.

89.9CVApr 10
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

Xiyang Huang, Jiawei Lin, Keying Wu et al.

Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.

CLOct 10, 2025Code
StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Yuchen Lu, Run Yang, Yichen Zhang et al.

Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.