Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments
This highlights a critical problem for legal practitioners and scholars relying on AI, as it shows LLMs are not ready for real-world legal interpretation tasks, making it an incremental but important cautionary finding.
The study found that large language models (LLMs) produce unstable legal interpretations with conclusions varying widely based on question format, and show only weak to moderate correlation with human judgments, indicating they are unreliable for legal practice.
Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.