Evaluation of an LLM in Identifying Logical Fallacies: A Call for Rigor When Adopting LLMs in HCI Research
It addresses the need for rigorous evaluation to prevent misuse of LLMs in HCI research, particularly for digital misinformation interventions, but is incremental as it applies existing methods to a new task.
The paper tackles the problem of evaluating LLMs for suitability in HCI research by assessing GPT-4's ability to identify logical fallacies, finding accuracies of 0.79 on a labeled dataset and 0.90 for a specific use case.
There is increasing interest in the adoption of LLMs in HCI research. However, LLMs may often be regarded as a panacea because of their powerful capabilities with an accompanying oversight on whether they are suitable for their intended tasks. We contend that LLMs should be adopted in a critical manner following rigorous evaluation. Accordingly, we present the evaluation of an LLM in identifying logical fallacies that will form part of a digital misinformation intervention. By comparing to a labeled dataset, we found that GPT-4 achieves an accuracy of 0.79, and for our intended use case that excludes invalid or unidentified instances, an accuracy of 0.90. This gives us the confidence to proceed with the application of the LLM while keeping in mind the areas where it still falls short. The paper describes our evaluation approach, results and reflections on the use of the LLM for our intended task.