CLAIMay 17

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

arXiv:2605.1769186.41 citations
Predicted impact top 46% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For legal NLP practitioners, this work provides a more robust evaluation framework and baseline for a high-stakes classification task.

This paper introduces a new expert-annotated dataset of 239 legal citations and a novel Average Severity Error metric to benchmark LLMs on multi-label precedent treatment classification. Gemini 2.5 Flash achieved 79.1% accuracy on high-level classification, while GPT-5-mini reached 67.7% on fine-grained schema.

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes