CLDec 28, 2023

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia

arXiv:2312.17080v413.461 citationsh-index: 14Has CodeICLR

Originality Incremental advance

AI Analysis

This work addresses the need for more comprehensive evaluation of LLMs' cognitive capabilities, particularly for researchers and developers, though it is incremental as it builds on existing datasets like GSM8K.

The authors tackled the problem of evaluating Large Language Models (LLMs) by shifting from result-oriented assessments to meta-reasoning, where models score solutions like a teacher, and introduced the MR-GSM8K benchmark, which revealed performance disparities of over 20 absolute points between models like Deepseek-v2 and GPT-4.

In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Notably, while models like Deepseek-v2 and Claude3-Sonnet closely competed with GPT-4 in GSM8K, their performance disparities expanded dramatically in MR-GSM8K, with differences widening to over 20 absolute points, underscoring the significant challenge posed by our meta-reasoning approach.

View on arXiv PDF Code

Similar