CLAIOct 18, 2024

From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items

arXiv:2410.14897v123 citationsh-index: 12EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of automated test creation for commonsense reasoning, which could benefit educators and researchers, but it is incremental as it builds on existing benchmarks and methods.

The study investigated whether large language models (LLMs) can generate commonsense assessment items, specifically in the style of the COPA benchmark, and found that LLMs that perform well on answering COPA questions are also more effective at creating such items.

LLMs can now perform a variety of complex writing tasks. They also excel in answering questions pertaining to natural language inference and commonsense reasoning. Composing these questions is itself a skilled writing task, so in this paper we consider LLMs as authors of commonsense assessment items. We prompt LLMs to generate items in the style of a prominent benchmark for commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine the outcome according to analyses facilitated by the LLMs and human annotation. We find that LLMs that succeed in answering the original COPA benchmark are also more successful in authoring their own items.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes