CLMay 30, 2025

DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

arXiv:2505.24532v1h-index: 27Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better benchmarks to evaluate LLMs in real-world scenarios, though it is incremental as it builds on existing datasets and taxonomy.

The authors tackled the problem of LLMs performing poorly on real-world tasks by introducing DeepQuestion, a framework that generates challenging questions based on Bloom's taxonomy, resulting in up to 70% accuracy drops for models on higher-order tasks.

LLMs often excel on standard benchmarks but falter on real-world tasks. We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom's taxonomy and creates novel questions that trace original solution paths to probe evaluative and creative skills. Extensive experiments across ten open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveal substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning. Our work highlights the need for cognitively diverse benchmarks to advance LLM progress. DeepQuestion and related datasets will be released upon acceptance of the paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes