CLJun 7, 2023

STEPS: A Benchmark for Order Reasoning in Sequential Tasks

arXiv:2306.04441v17 citationsh-index: 12
AI Analysis

This work addresses the challenge of ensuring correct action sequences for robots and AI agents in tasks like cooking and manufacturing, though it is incremental as it introduces a new benchmark rather than a novel method.

The authors tackled the problem of evaluating order reasoning in sequential tasks by proposing the STEPS benchmark, which tests models on determining step rationality and selection in recipes, and found that current LLMs struggle with zero-shot and few-shot learning, with prompting methods lagging behind tuning-based approaches.

Various human activities can be abstracted into a sequence of actions in natural text, i.e. cooking, repairing, manufacturing, etc. Such action sequences heavily depend on the executing order, while disorder in action sequences leads to failure of further task execution by robots or AI agents. Therefore, to verify the order reasoning capability of current neural models in sequential tasks, we propose a challenging benchmark , named STEPS. STEPS involves two subtask settings, focusing on determining the rationality of given next step in recipes and selecting the reasonable step from the multi-choice question, respectively. We describe the data construction and task formulations, and benchmark most of significant Large Language Models (LLMs). The experimental results demonstrate 1) The commonsense reasoning of action orders in sequential tasks are challenging to resolve via zero-shot prompting or few-shot in-context learning for LLMs; 2) Prompting method still significantly lags behind tuning-based method on STEPS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes