Evaluating LLM-Generated Q&A Test: a Student-Centered Study
This provides a scalable approach to AI-assisted assessment development for educational institutions, though it is incremental as it applies existing LLM methods to a new domain.
The researchers tackled the problem of creating reliable AI-generated educational assessments by developing an automatic pipeline using GPT-4o-mini to produce Q&A tests for a Natural Language Processing course. The results showed that the generated items exhibited strong discrimination and appropriate difficulty in IRT analysis, with high student and expert ratings, demonstrating they can match human-authored tests in psychometric performance and user satisfaction.
This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.