CL IRMay 12

Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

arXiv:2605.1239879.4

Predicted impact top 71% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and developers of QA systems, this provides an interpretable and scalable method to estimate question difficulty without relying on external resources.

The paper introduces Q-DAPS, a method that estimates question difficulty for LLMs by computing entropy of answer plausibility scores. It outperforms baselines across four QA datasets and aligns well with human judgments.

Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.

View on arXiv PDF

Similar