CLHCMay 10, 2025

Evaluating LLM-Generated Q&A Test: a Student-Centered Study

arXiv:2505.06591v13 citationsh-index: 2AIED
Originality Synthesis-oriented
AI Analysis

This provides a scalable approach to AI-assisted assessment development for educational institutions, though it is incremental as it applies existing LLM methods to a new domain.

The researchers tackled the problem of creating reliable AI-generated educational assessments by developing an automatic pipeline using GPT-4o-mini to produce Q&A tests for a Natural Language Processing course. The results showed that the generated items exhibited strong discrimination and appropriate difficulty in IRT analysis, with high student and expert ratings, demonstrating they can match human-authored tests in psychometric performance and user satisfaction.

This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes