Are Large Language Models Consistent over Value-laden Questions?
This addresses the problem of assessing LLM reliability for simulating human values, which is important for AI ethics and deployment, though it is incremental by building on prior inconsistency claims with new measures.
The study investigated whether large language models (LLMs) are consistent in their answers to value-laden questions, finding that models are relatively consistent across paraphrases, use-cases, translations, and within topics, with base models being more consistent than fine-tuned ones and consistency varying by topic (e.g., 8,000 questions across 300 topics).
Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to small and large, open LLMs including llama-3, as well as gpt-4o, using 8,000 questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones ("euthanasia"). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics ("euthanasia") than others ("women's rights") like our human subjects (n=165).