Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference
This work addresses the problem of biases in LLM-generated datasets for NLP researchers and practitioners, showing that LLMs inherit similar issues as human-annotated data, which is incremental but important for dataset creation.
The study tested whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases similar to crowd-sourced datasets, using GPT-4, Llama-2 70b, and Mistral 7b to recreate a portion of the Stanford Natural Language Inference corpus. Fine-tuned BERT hypothesis-only classifiers achieved 86-96% accuracy on LLM-generated datasets, and analyses characterized the specific annotation artifacts and stereotypical biases present.
We test whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases like NLP datasets elicited from crowd-source workers. We recreate a portion of the Stanford Natural Language Inference corpus using GPT-4, Llama-2 70b for Chat, and Mistral 7b Instruct. We train hypothesis-only classifiers to determine whether LLM-elicited NLI datasets contain annotation artifacts. Next, we use pointwise mutual information to identify the words in each dataset that are associated with gender, race, and age-related terms. On our LLM-generated NLI datasets, fine-tuned BERT hypothesis-only classifiers achieve between 86-96% accuracy. Our analyses further characterize the annotation artifacts and stereotypical biases in LLM-generated datasets.