CLAIMay 27

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

arXiv:2605.2917070.7h-index: 1
Predicted impact top 90% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For NLP and legal AI researchers, this benchmark exposes failure modes of LLMs in a morphologically rich, non-English legal domain and highlights the need for proper evaluation metrics on imbalanced tasks.

UA-Legal-Bench introduces a five-task benchmark for evaluating LLMs on Ukrainian legal reasoning, finding that few-shot prompting improves judgment form classification by up to +38.6 percentage points but has mixed effects on outcome prediction, and that accuracy is misleading on imbalanced tasks (e.g., a model with 62% COP accuracy is a majority-class predictor with 23% macro-F1).

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes