CLHCOct 27, 2023

DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues

arXiv:2310.18130v2136 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of assessing LLMs' performance on controversial issues for researchers and developers, but it is incremental as it builds upon an existing dataset.

The authors tackled the lack of datasets for evaluating how large language models (LLMs) handle controversial issues by constructing a novel dataset from the Quora Question Pairs Dataset, which includes human-annotated labels for contemporary debates. They evaluated various LLMs on this dataset to analyze their responses and stances on controversial topics.

Controversy is a reflection of our zeitgeist, and an important aspect to any discourse. The rise of large language models (LLMs) as conversational systems has increased public reliance on these systems for answers to their various questions. Consequently, it is crucial to systematically examine how these models respond to questions that pertaining to ongoing debates. However, few such datasets exist in providing human-annotated labels reflecting the contemporary discussions. To foster research in this area, we propose a novel construction of a controversial questions dataset, expanding upon the publicly released Quora Question Pairs Dataset. This dataset presents challenges concerning knowledge recency, safety, fairness, and bias. We evaluate different LLMs using a subset of this dataset, illuminating how they handle controversial issues and the stances they adopt. This research ultimately contributes to our understanding of LLMs' interaction with controversial issues, paving the way for improvements in their comprehension and handling of complex societal debates.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes