CL AIMay 17

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

arXiv:2605.1769186.41 citations

Predicted impact top 46% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For legal NLP practitioners, this work provides a more robust evaluation framework and baseline for a high-stakes classification task.

This paper introduces a new expert-annotated dataset of 239 legal citations and a novel Average Severity Error metric to benchmark LLMs on multi-label precedent treatment classification. Gemini 2.5 Flash achieved 79.1% accuracy on high-level classification, while GPT-5-mini reached 67.7% on fine-grained schema.

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

View on arXiv PDF

Similar