CLFeb 17, 2025

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

arXiv:2502.11393v27 citationsh-index: 28ACL
Originality Incremental advance
AI Analysis

This work provides a high-quality benchmark for evaluating LLM robustness in commonsense reasoning, offering insights for the AI community, though it is incremental as it builds on existing datasets like HellaSwag.

The authors tackled the problem of evaluating the robustness of large language models (LLMs) in commonsense reasoning by introducing HellaSwag-Pro, a bilingual benchmark with 11,200 cases, and found that 41 tested LLMs are far from robust, with performance varying by language.

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes