CLCYJan 28, 2025

LLMs Provide Unstable Answers to Legal Questions

arXiv:2502.05196v117 citationsh-index: 8ICAIL
Originality Synthesis-oriented
AI Analysis

This highlights a critical reliability issue for legal AI products, processes, and professionals relying on LLMs, though it is incremental in focusing on instability rather than proposing a solution.

The study found that leading LLMs like GPT-4o, Claude-3.5, and Gemini-1.5 are unstable, providing inconsistent answers to identical hard legal questions even with deterministic settings, as shown on a novel dataset of 500 legal questions.

An LLM is stable if it reaches the same conclusion when asked the identical question multiple times. We find leading LLMs like gpt-4o, claude-3.5, and gemini-1.5 are unstable when providing answers to hard legal questions, even when made as deterministic as possible by setting temperature to 0. We curate and release a novel dataset of 500 legal questions distilled from real cases, involving two parties, with facts, competing legal arguments, and the question of which party should prevail. When provided the exact same question, we observe that LLMs sometimes say one party should win, while other times saying the other party should win. This instability has implications for the increasing numbers of legal AI products, legal processes, and lawyers relying on these LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes