CL AIAug 27, 2025

Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, Christopher Qiao

arXiv:2508.20217v12.7

Originality Incremental advance

AI Analysis

This addresses the problem of reducing cost and inconsistency in manual test development for K-12 education, though it is incremental as it builds on existing prompting techniques.

This study tackled the problem of automatically generating multiple choice questions for K-12 morphological assessment by comparing fine-tuned medium models with larger untuned ones and evaluating seven structured prompting strategies. Results showed that structured prompting, especially combining chain-of-thought and sequential design, significantly improved the medium model's outputs, producing more construct-aligned and instructionally appropriate items than the larger model's zero-shot responses.

This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.

View on arXiv PDF

Similar