GLoRE: Evaluating Logical Reasoning of Large Language Models
This addresses the need for standardized evaluation of logical reasoning in LLMs for researchers and developers, though it is incremental as it builds on existing datasets and methods.
The authors tackled the problem of assessing logical reasoning in large language models by introducing GLoRE, a platform that consolidates and standardizes datasets for evaluation, and found that models like QwQ-32B achieved the highest benchmark performance to date.
Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.