CL CYSep 16, 2024

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

arXiv:2409.09947v23.43 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of reliable evaluation for LLM-generated legal text, which is crucial for professionals using AI writing aids, though it is incremental as it builds on existing tasks and datasets.

The paper tackles the problem of evaluating machine-generated legal analysis by introducing the concept of 'gaps' to distinguish from hallucinations, and develops a fine-grained detector that achieves 67% F1 score and 80% precision, finding that around 80% of SOTA LLM-generated analyses contain hallucinations.

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

View on arXiv PDF Code

Similar