CL AIJan 28

Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

arXiv:2601.20253v11.11 citationsh-index: 8

Originality Incremental advance

AI Analysis

This addresses the problem of evaluating contextualized reasoning in real-world domains for AI researchers, though it is incremental as it builds on existing benchmark generation methods with a specific taxonomic approach.

The paper tackles the challenge of evaluating open-ended question answering in practice-based domains by introducing an automated framework that generates benchmarks from expert guidelines using Bloom's Taxonomy, applied to teaching, dietetics, and caregiving. The results show that LLMs sometimes perform better on higher-order reasoning (Analyze) but fail more on lower-level items (Remember), producing large-scale benchmarks that reveal non-intuitive model behaviors.

Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.

View on arXiv PDF

Similar