AIAug 6, 2025

SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

Mei Jiang, Houping Yue, Bingdong Li, Hao Hao, Ying Qian, Bo Jiang, Aimin Zhou

arXiv:2508.04563v11 citationsh-index: 11

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for assessing LLMs in educational guidance, addressing a scaling issue in STEM teaching, though it is incremental as it focuses on evaluation rather than novel methods.

The authors tackled the problem of evaluating LLMs' guided instruction capabilities in interdisciplinary STEM education by introducing SID, a benchmark with 10,000 dialogue turns across 48 projects, and found that even state-of-the-art LLMs struggle to achieve effective knowledge integration and transfer.

Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.

View on arXiv PDF

Similar