Transformers in the Service of Description Logic-based Contexts
This work addresses the need for more challenging benchmarks to assess reasoning in AI, though it is incremental as it builds on existing transformer methods.
The authors tackled the problem of evaluating transformer models on complex reasoning tasks by constructing DELTA$_D$, a 384K-example dataset based on description logic, and found that a fine-tuned DeBERTa model mastered the task while GPT-3.5 and GPT-4 showed significant improvement with few-shot prompting (e.g., 9 shots).
Recent advancements in transformer-based models have initiated research interests in investigating their ability to learn to perform reasoning tasks. However, most of the contexts used for this purpose are in practice very simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. In this work, we construct the natural language dataset, DELTA$_D$, using the description logic language $\mathcal{ALCQ}$. DELTA$_D$ contains 384K examples, and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the reasoning ability of a supervised fine-tuned DeBERTa-based model and of two large language models (GPT-3.5, GPT-4) with few-shot prompting. Our results demonstrate that the DeBERTa-based model can master the reasoning task and that the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.