CLApr 11, 2024

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

arXiv:2404.07720v280 citationsh-index: 6READI
AI Analysis

This addresses the time-consuming task of test creation for educators and researchers, particularly in languages with limited data, though it is incremental as it applies existing LLMs to a specific domain.

The paper tackled the problem of manually creating reading comprehension tests by using large language models (LLMs) to generate and evaluate multiple-choice items, finding that GPT-4 outperformed Llama 2 in generating acceptable quality items and that GPT-4's evaluation results were most similar to human annotators.

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes