ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?
This work addresses the challenge of accurate LLM evaluation in climate science and policy, which is crucial for reliable communication and information retrieval, though it is incremental as it applies existing methods to a new dataset.
The authors tackled the problem of evaluating LLMs' ability to assess human expert confidence in climate statements by introducing the ClimateX dataset with 8,094 expert-labeled statements, finding that LLMs achieve up to 47% accuracy in classification but show significant over-confidence on low and medium confidence statements.
Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.