AICLLGJun 4, 2024

ACCORD: Closing the Commonsense Measurability Gap

arXiv:2406.02804v212 citations
AI Analysis

This addresses the commonsense measurability gap for AI researchers and developers by providing a scalable benchmark, though it is incremental as it builds on existing evaluation methods.

The authors tackled the problem of measuring commonsense reasoning in large language models by introducing ACCORD, a framework and benchmark suite that uses controlled multi-hop counterfactuals to disentangle grounding and reasoning abilities, and found that state-of-the-art models like GPT-4o degrade to random chance with moderate scaling, indicating significant room for improvement.

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes