CL AIMay 27

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

arXiv:2605.2917070.7h-index: 1

Predicted impact top 90% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For NLP and legal AI researchers, this benchmark exposes failure modes of LLMs in a morphologically rich, non-English legal domain and highlights the need for proper evaluation metrics on imbalanced tasks.

UA-Legal-Bench introduces a five-task benchmark for evaluating LLMs on Ukrainian legal reasoning, finding that few-shot prompting improves judgment form classification by up to +38.6 percentage points but has mixed effects on outcome prediction, and that accuracy is misleading on imbalanced tasks (e.g., a model with 62% COP accuracy is a majority-class predictor with 23% macro-F1).

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

View on arXiv PDF

Similar