CLApr 4, 2025

CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)

Abhilekh Borah, Hasnat Md Abdullah, Kangda Wei, Ruihong Huang

arXiv:2504.03906v16.72 citationsh-index: 4Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Originality Highly original

AI Analysis

This addresses the need to assess whether LLMs amplify credible climate solutions or spread unsubstantiated claims, which is crucial for researchers and policymakers dealing with climate misinformation on social media.

The paper tackles the problem of evaluating Large Language Models' ability to understand multimodal climate discourse on social media by introducing CliME, a first-of-its-kind dataset of 2579 Twitter and Reddit posts, and the Climate Alignment Quotient (CAQ) metric. The result shows that while most LLMs perform well in Criticality and Justice, they consistently underperform on Actionability, with Claude 3.7 Sonnet achieving the highest overall performance.

The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.

View on arXiv PDF

Similar