Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
This work addresses the robustness gap in LLMs for logical reasoning tasks, which is crucial for reliable AI applications in domains requiring precise reasoning, though it is incremental as it builds on existing datasets and methods.
The researchers tackled the problem of assessing and enhancing the robustness of large language models (LLMs) in logical reasoning by creating three new datasets with task structure variations, finding that simple augmentations like shuffling options and substituting correct answers greatly hindered model performance, and demonstrating that introducing task variations during training and using logic-driven data augmentation improved performance on both original and new datasets.
Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.