AI CL LGJun 4, 2024

ACCORD: Closing the Commonsense Measurability Gap

François Roewer-Després, Jinyue Feng, Zining Zhu, Frank Rudzicz

arXiv:2406.02804v215.412 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the commonsense measurability gap for AI researchers and developers by providing a scalable benchmark, though it is incremental as it builds on existing evaluation methods.

The authors tackled the problem of measuring commonsense reasoning in large language models by introducing ACCORD, a framework and benchmark suite that uses controlled multi-hop counterfactuals to disentangle grounding and reasoning abilities, and found that state-of-the-art models like GPT-4o degrade to random chance with moderate scaling, indicating significant room for improvement.

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

View on arXiv PDF Code

Similar