CLLGDec 15, 2022

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

BerkeleyMeta AIMicrosoftU of TorontoUW
arXiv:2212.07919v2237 citationsh-index: 116
Originality Incremental advance
AI Analysis

This work addresses the need for reliable evaluation methods to assess reasoning steps in AI models, which is crucial for improving interpretability and verification in tasks requiring reasoning.

The authors tackled the problem of automatically evaluating the correctness of step-by-step reasoning generated by large language models, presenting ROSCOE, a suite of unsupervised metrics that outperform existing baselines on diverse reasoning datasets.

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes