CL AIFeb 13, 2023

STREET: A Multi-Task Structured Reasoning and Explanation Benchmark

Danilo Ribeiro, Shen Wang, Xiaofei Ma, Henry Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica Ramos, William Wang, Zhiheng Huang, George Karypis, Bing Xiang

Amazon

arXiv:2302.06729v16.827 citationsh-index: 99

Originality Incremental advance

AI Analysis

This provides a new benchmark for training and testing multi-step reasoning and explanation systems in natural language, addressing a gap in existing QA datasets.

The authors tackled the problem of evaluating natural language reasoning and explanation by introducing STREET, a benchmark requiring models to provide step-by-step structured explanations alongside answers, and found that models like GPT-3 and T5 lag behind human performance.

We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.

View on arXiv PDF

Similar