42.9AIApr 14
Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained AbilitiesXu Zhang, Xudong Gong, Jiacheng Qin et al.
Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.
LGJan 11, 2024
Optimistic Model Rollouts for Pessimistic Offline Policy OptimizationYuanzhao Zhai, Yiying Li, Zijian Gao et al.
Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
CRJul 11, 2013
A Secure Distributed Authentication scheme based on CRT-VSS and Trusted Computing in MANETQiwei Lu, Wenchao Huang, Xudong Gong et al.
With the rapid development of MANET, secure and practical authentication is becoming increasingly important. The existing works perform the research from two aspects, i.e., (a)secure key division and distributed storage, (b)secure distributed authentication. But there still exist several unsolved problems. Specifically, it may suffer from cheating problems and fault authentication attack, which can result in authentication failure and DoS attack towards authentication service. Besides, most existing schemes are not with satisfactory efficiency due to exponential arithmetic based on Shamir's scheme. In this paper, we explore the property of verifiable secret sharing(VSS) schemes with Chinese Remainder Theorem (CRT), then propose a secret key distributed storage scheme based on CRT-VSS and trusted computing for MANET. Specifically, we utilize trusted computing technology to solve two existing cheating problems in secret sharing area before. After that, we do the analysis of homomorphism property with CRT-VSS and design the corresponding shares-product sharing scheme with better concision. On such basis, a secure distributed Elliptic Curve-Digital Signature Standard signature (ECC-DSS) authentication scheme based on CRT-VSS scheme and trusted computing is proposed. Furthermore, as an important property of authentication scheme, we discuss the refreshing property of CRT-VSS and do thorough comparisons with Shamir's scheme. Finally, we provide formal guarantees towards our schemes proposed in this paper.