Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation

Ondřej Holub, Essi Ryymin, Rodrigo Alves

arXiv:2601.14798v1h-index: 4

Originality Incremental advance

AI Analysis

It addresses the time-consuming task of designing good reflection questions for teachers, but is incremental as it builds on existing LLM methods with a novel multi-agent approach.

This paper tackles the problem of automating the generation of reflection questions for education by introducing a reflection-in-reflection framework that uses two role-specialized LLM agents in a Socratic dialogue, resulting in questions judged substantially more relevant and deeper than a one-shot baseline.

Designing good reflection questions is pedagogically important but time-consuming and unevenly supported across teachers. This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs). Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question given a teacher-specified topic, key concepts, student level, and optional instructional materials. The Student-Teacher proposes candidate questions with brief rationales, while the Teacher-Educator evaluates them along clarity, depth, relevance, engagement, and conceptual interconnections, responding only with targeted coaching questions or a fixed signal to stop the dialogue. We evaluate the framework in an authentic lower-secondary ICT setting on the topic, using GPT-4o-mini as the backbone model and a stronger GPT- 4-class LLM as an external evaluator in pairwise comparisons of clarity, relevance, depth, and overall quality. First, we study how interaction design and context (dynamic vs.fixed iteration counts; presence or absence of student level and materials) affect question quality. Dynamic stopping combined with contextual information consistently outperforms fixed 5- or 10-step refinement, with very long dialogues prone to drift or over-complication. Second, we show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline using the same backbone model.

View on arXiv PDF

Similar