CLAICYJan 14, 2025

Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

arXiv:2501.08167v216 citationsh-index: 13
AI Analysis

This addresses the problem of scalable evaluation of AI-generated text summaries for organizations using LLMs to analyze unstructured data, though it is incremental in validating existing LLM-as-judge approaches.

This research investigated whether large language models (LLMs) can reliably evaluate thematic summaries of unstructured text data like survey responses, comparing LLM-as-judge models (using Anthropic Claude, Amazon's Titan Express, Nova Pro, and Meta's Llama) to human evaluations with metrics like Cohen's kappa, Spearman's rho, and Krippendorff's alpha. The results showed that LLM-as-judge models offer a scalable solution comparable to human raters, but humans may still be better at detecting subtle, context-specific nuances.

Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as judges. This LLM-as-judge approach was compared to human evaluations using Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. Our research contributes to the growing body of knowledge on AI assisted text analysis. Further, we provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM-as-judge models across various contexts and use cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes