CLLGNov 18, 2024

Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning

arXiv:2411.14465v11 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of assessing LLM reliability in physics for researchers and users, but it is incremental as it applies existing uncertainty analysis to a new domain.

The study evaluated the uncertainty and accuracy of large language models on physics multiple-choice questions, finding that models are accurate when certain but show a bell-shaped distribution overall, with uncertainty increasing more for reasoning tasks than knowledge retrieval.

Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to "hallucinate" their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model's predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell-shaped distribution. We report how the asymmetry between accuracy and uncertainty intensifies as the questions demand more logical reasoning of the LLM agent, while the same relationship remains sharp for knowledge retrieval tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes