CLOct 18, 2024

SylloBio-NLI: Evaluating Large Language Models on Biomedical Syllogistic Reasoning

Magdalena Wysocka, Danilo Carvalho, Oskar Wysocki, Marco Valentino, Andre Freitas

arXiv:2410.14399v211.214 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable automated evidence interpretation in biomedicine, but it is incremental as it builds on existing NLI and prompting methods.

The paper tackled the challenge of biomedical syllogistic reasoning in large language models (LLMs) by introducing SylloBio-NLI, a framework using ontologies to evaluate LLMs, finding that zero-shot LLMs achieve accuracies from 70% to 23% across schemes, and few-shot prompting boosts performance by up to 43% but remains sensitive to lexical variations.

Syllogistic reasoning is crucial for Natural Language Inference (NLI). This capability is particularly significant in specialized domains such as biomedicine, where it can support automatic evidence interpretation and scientific discovery. This paper presents SylloBio-NLI, a novel framework that leverages external ontologies to systematically instantiate diverse syllogistic arguments for biomedical NLI. We employ SylloBio-NLI to evaluate Large Language Models (LLMs) on identifying valid conclusions and extracting supporting evidence across 28 syllogistic schemes instantiated with human genome pathways. Extensive experiments reveal that biomedical syllogistic reasoning is particularly challenging for zero-shot LLMs, which achieve an average accuracy between 70% on generalized modus ponens and 23% on disjunctive syllogism. At the same time, we found that few-shot prompting can boost the performance of different LLMs, including Gemma (+14%) and LLama-3 (+43%). However, a deeper analysis shows that both techniques exhibit high sensitivity to superficial lexical variations, highlighting a dependency between reliability, models' architecture, and pre-training regime. Overall, our results indicate that, while in-context examples have the potential to elicit syllogistic reasoning in LLMs, existing models are still far from achieving the robustness and consistency required for safe biomedical NLI applications.

View on arXiv PDF

Similar