AICLDec 17, 2024

Are Your LLMs Capable of Stable Reasoning?

arXiv:2412.13147v565 citationsh-index: 21ACL
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable LLM evaluations for researchers and practitioners, though it is incremental as it focuses on improving metrics rather than model capabilities.

The paper tackles the gap between LLM benchmark performance and real-world applications by introducing G-Pass@k, a metric that assesses both accuracy and consistency across multiple sampling attempts, revealing significant opportunities to enhance realistic reasoning abilities.

The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model's performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes