CLFeb 2, 2025

The Accuracy, Robustness, and Readability of LLM-Generated Sustainability-Related Word Definitions

arXiv:2502.00916v117.012 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of ensuring accurate and clear terminology in climate discussions for researchers and policymakers, but it is incremental as it evaluates existing LLMs on a specific dataset without introducing new methods.

The study compared LLM-generated definitions of sustainability terms with official IPCC glossary definitions, finding that models like GPT-4o-mini, Llama3.1 8B, and Mistral 7B had average adherence scores of 0.57-0.59 ± 0.15 and produced less readable definitions, with variations mainly for ambiguous terms.

A common language with standardized definitions is crucial for effective climate discussions. However, concerns exist about LLMs misrepresenting climate terms. We compared 300 official IPCC glossary definitions with those generated by GPT-4o-mini, Llama3.1 8B, and Mistral 7B, analyzing adherence, robustness, and readability using SBERT sentence embeddings. The LLMs scored an average adherence of $0.57-0.59 \pm 0.15$, and their definitions proved harder to read than the originals. Model-generated definitions vary mainly among words with multiple or ambiguous definitions, showing the potential to highlight terms that need standardization. The results show how LLMs could support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.

View on arXiv PDF

Similar