CLAIJun 6, 2024

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

arXiv:2406.04145v129 citations
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of machine commonsense, which is crucial for AI systems to handle real-world ambiguity, though it is incremental in proposing a new task and method.

The authors tackled the problem of evaluating commonsense in language models by introducing a new generative task, commonsense frame completion (CFC), which captures its probabilistic nature, and proposed a probabilistic evaluation method that correlates with human judgments, showing humans drastically outperform models on this dataset.

Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes