CLFeb 2, 2025

The Accuracy, Robustness, and Readability of LLM-Generated Sustainability-Related Word Definitions

arXiv:2502.00916v112 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of ensuring accurate and clear terminology in climate discussions for researchers and policymakers, but it is incremental as it evaluates existing LLMs on a specific dataset without introducing new methods.

The study compared LLM-generated definitions of sustainability terms with official IPCC glossary definitions, finding that models like GPT-4o-mini, Llama3.1 8B, and Mistral 7B had average adherence scores of 0.57-0.59 ± 0.15 and produced less readable definitions, with variations mainly for ambiguous terms.

A common language with standardized definitions is crucial for effective climate discussions. However, concerns exist about LLMs misrepresenting climate terms. We compared 300 official IPCC glossary definitions with those generated by GPT-4o-mini, Llama3.1 8B, and Mistral 7B, analyzing adherence, robustness, and readability using SBERT sentence embeddings. The LLMs scored an average adherence of $0.57-0.59 \pm 0.15$, and their definitions proved harder to read than the originals. Model-generated definitions vary mainly among words with multiple or ambiguous definitions, showing the potential to highlight terms that need standardization. The results show how LLMs could support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes