Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions
This work addresses the need to understand LLMs' reasoning limitations for AI researchers, but it is incremental as it adapts an existing benchmark for LLMs.
The paper tackles the problem of evaluating large language models' capabilities in nonmonotonic reasoning by proposing a benchmark based on defeasible rule-based reasoning patterns, with preliminary experiments on ChatGPT showing initial results but lacking concrete performance numbers.
Large Language Models (LLMs) have gained prominence in the AI landscape due to their exceptional performance. Thus, it is essential to gain a better understanding of their capabilities and limitations, among others in terms of nonmonotonic reasoning. This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns. We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for LLMs. We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.