LGAICLCYIRNov 28, 2023

ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?

arXiv:2311.17107v17 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of accurate LLM evaluation in climate science and policy, which is crucial for reliable communication and information retrieval, though it is incremental as it applies existing methods to a new dataset.

The authors tackled the problem of evaluating LLMs' ability to assess human expert confidence in climate statements by introducing the ClimateX dataset with 8,094 expert-labeled statements, finding that LLMs achieve up to 47% accuracy in classification but show significant over-confidence on low and medium confidence statements.

Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes