CL CYJul 1, 2024

MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education

Naiming Liu, Shashank Sonkar, Myco Le, Richard Baraniuk

arXiv:2407.00938v215.226 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better counterfactual reasoning in LLMs, particularly for AI-powered tutoring systems to identify student misconceptions, though it appears incremental as it builds on existing evaluation paradigms.

The paper tackles the problem of evaluating counterfactual reasoning in Large Language Models by introducing the MalAlgoQA dataset with 'malgorithms' (flawed rationales), finding that state-of-the-art LLMs show significant performance drops (e.g., lower Malgorithm Identification Accuracy vs. Algorithm Identification Accuracy) and that chain-of-thought prompting sometimes worsens performance.

This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. At the heart of MalAlgoQA are ``malgorithms'' - rationales behind incorrect answer choices that represent flawed yet logically coherent reasoning paths. These malgorithms serve as counterfactual scenarios, allowing us to assess an LLM's ability to identify and analyze flawed reasoning patterns. We propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identification Accuracy (MIA) for incorrect answer rationale identification. Our experiments reveal that state-of-the-art LLMs exhibit significant performance drops in MIA compared to AIA, highlighting the challenges in counterfactual reasoning. Surprisingly, we find that the chain-of-thought prompting technique not only fails to consistently enhance MIA but can sometimes lead to underperformance compared to simple prompting. These findings have important implications for developing LLMs with improved counterfactual reasoning, particularly relevant for AI-powered tutoring systems, where identifying and addressing student misconceptions is essential. MalAlgoQA dataset is available \href{https://github.com/luffycodes/MalAlgoQA-Dataset}{here}.

View on arXiv PDF Code

Similar